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Between stimulus and response there is a 
space. In that space is our power to choose 
our response. In our response lies our growth 
and our freedom. 


—Viktor E. Frankl 


Preface 


Sonic Interaction Design (SID) is the study and exploitation of sound being one of the 
principal channels conveying information, meaning, esthetic, and emotional quali- 
ties in interactive contexts. The field of Sonic Interactions in Virtual Environments 
(SIVE) extends SID to immersive media, i.e., virtual/augmented/mixed reality (XR). 
Considering a virtuality continuum, this book mainly focused on virtual reality (VR) 
also facing occasionally mixed and hybrid reality settings. 

The basic and most obvious assumption that motivates this volume is: it is hard 
to live in a world without sound and it is hard in virtual environments (VE) too. VR 
without plausible and convincing sounds feels unnatural to users. Auditory infor- 
mation is a powerful omnidirectional source of learning for our interaction in real 
and virtual environments. The good news brought by this book is that VR finally 
sounds plausible. Advances in several fields are now able to provide an immersive 
listening experience that is perceptually indistinguishable from reality which means 
that immersive sounds could make interaction intrinsically natural. Auralization and 
spatial audio technologies play a fundamental role in providing immersion and pres- 
ence in VR applications at an unprecedented level. The combination of recent devel- 
opments in VR headsets and earables further strengthens the perceptual validity of 
multimodal virtual environments and experiences. 

We can therefore promote a true audio-centered and audio-first design for VR 
with levels of realism and immersiveness that can even surpass the visual counterpart. 
Visuals, although rightly emphasized by many studies and products, are often not very 
effectively enhanced and strengthened by sound. The final result is a weakening of 
multisensory integration and the corresponding VR potentials that strongly determine 
the quality and durability of the experience. 

The editors would like to identify two starting points in the past 10 years that have 
given rise and awareness to the SIVE research area and studies. The first episode 
is symbolic: we would like to anecdotally bring back from our memories the first 
meeting between us, the two editors of the book. The year was 2011, exactly 10 years 
ago. Michele had recently started his Ph.D. at the Sound and Music Computing 
Group of the Department of Information Engineering at the University of Padua, 
under the supervision of Dr. Avanzini. The Italian Association of Musical Informatics 
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(AIMI) organized the workshop “Sound and Music Computing for Human-Computer 
Interaction” at the ninth edition of the Biannual Conference of the Italian ACM 
SIGCHI Chapter (CHItaly) at the beautiful Alghero in Sardinia in early September. 
A great period for the seaside. 

Michele was asked to write his first conference paper to be presented at the work- 
shop entitled “Customized 3D Sound for Innovative Interaction Design,” An article 
with a high-sounding title that promises a lot but provides little: in short, an article of 
which not to be proud. On the other hand, there were some valuable references to the 
egocentric audio perspective that will be formalized in the introductory chapter of 
this book. However, the reason why we tell this anecdote is that at his first presentation 
at a scientific conference for the Ph.D. student Michele Geronazzo, among the very 
small audience, there was Dr. Stefania Serafin. Ten years ago, we began to discuss 
issues that connected sonic interaction design with immersive 3D audio in VR. The 
AIMI president of that time failed to get the workshop’s contributions included in 
the official ACM CHlItaly proceedings despite a regular peer-review process. The 
poor Ph.D. student Michele found himself without an official publication, at his first 
conference, in an unknown scientific community. We like to think that at that event 
and with that meeting started something much more relevant and impactful: SIVE. 
We are here to give it a shape in this book edited and structured together. 

Another temporal coincidence brings us to connect this story with the second and 
official starting point of this adventure. Michele’s unpublished conference paper was 
finally published within his doctoral thesis, defended in 2014, the year in which the 
IEEE Virtual Reality workshop series “Sonic Interactions in Virtual Environments 
(SIVE)” started (https://sive.create.aau.dk/). The mission of IEEE VR SIVE was to 
increase among the virtual reality community and junior researchers the awareness 
of the importance of sonic elements when designing immersive XR environments. 
However, we can also identify a certain degree of reciprocity while considering 
the fragmented nature and specificity of those studies aim at developing immersive 
XR environments for sound and music. First, we, therefore, refer to our beloved 
Sound and Music Computing (SMC) network, and then we consider the Interna- 
tional Community for Auditory Display (ICAD), the Audio Engineering Society 
(AES), and the communities linked to the International Conference on New Inter- 
faces for Musical Expression (NIME), the Digital Audio Effects (DAFX), and the 
Sonic Interaction Design COST Action (COST-SID IC601, ended in 2012). All these 
communities address aspects of the SIVE topics according to their specificities. No 
institutional nor contextual references that collect technological developments, best 
practices, and creative efforts related to the peculiarities of immersive VEs existed 
before the SIVE workshop. The book follows a similar philosophy trying to give 
an exhaustive view of those multidisciplinary topics already mentioned in our two 
recent reviews.! It features state-of-the-art research on real-time auralization, sonic 


1 S, Serafin, M. Geronazzo, N. C. Nilsson, C. Erkut, and R. Nordahl, “Sonic interactions in virtual 
reality: state of the art, current challenges and future directions,’ IEEE Computer Graphics and 
Applications, vol. 38, no. 2, pp. 31—43, 2018. 

S. Serafin et al., “Reflections from five years of Sonic Interactions in Virtual Environments 
workshops,” Journal of New Music Research, vol. 49, no. 1, pp. 24-34, Jan. 2020. 


Preface ix 


interaction design in VR, quality of the experience in multimodal environments, and 
applications. We aim to provide an organized starting point on which to develop 
a new generation of immersive experiences and applications. Since the editors are 
aware of the very fast social transformation by the acceleration in the development 
of digital technologies, all chapters should be read as entry points. Future scenarios 
and solutions will necessarily evolve by combining emerging research areas such as 
artificial intelligence, ubiquitous and pervasive computing, quantum technologies, 
as well as continuous discoveries in the neuroscientific field and anthropological 
reflections on the authenticity of the experience in VR. 

For this reason, contributing authors and editors include interdisciplinary experts 
from the fields of computer science, engineering, acoustics, psychology, design, 
humanities, and beyond. So that we can give to the reader a broad view and a clear 
introduction to the state-of-the-art technologies and design principles, and to the 
challenges that might be awaiting us in the future. 

Through an overview of emerging topics, theories, methods, tools, and practices 
in sonic interactions in virtual environments research, the book aims to establish the 
basis for further development of this new research area. The authors were invited to 
contribute to specific topics according to their well-known expertise. They followed 
a predefined structure outlined by the editors. 


The book is divided into four parts: 

Part I, Introduction: this theoretical part frames the background and the key 
themes in SIVE. The editors address several phenomenological foundational issues 
intending to shape a new research field from an archipelago of studies scattered in 
different research communities. 

Part II, Interactive and Immersive Audio: we cover the system requirement 
part with four chapters introducing and analyzing audio-related technological aspects 
and challenges. With some overlaps and connections, the four chapters deal with the 
plausibility of an immersive rendering able to tackle the computational burden. To do 
so, we deal with methods and algorithms for real-time rendering considering sound 
production, propagation, and spatialization, respectively. Finally, the reproduction 
and evaluation phase allows closing the development loop of new audio technologies. 

Part III, Sonic Interactions: a sonic interaction design part devoted to empha- 
sizing the peculiar aspects of sound in immersive media. In particular, spatial interac- 
tions are important where we would like to produce and transform ideas and actions 
to create meaning with VR, as well as the virtual auditory space is an informa- 
tion container that could be shaped by users. As the VR systems enter people’s 
lives, manufacturers, developers, and creators should carefully consider an embodied 
experience ready to share a common space with peers, collaboratively. 

Part IV, Sonic Experiences: the last part focuses on multimodal integration 
for sonic experiences in VR with the help of several case studies. Starting from a 
literature review of multimodal experiments and experiences with sound, this last 
part offers some reflections on the concept of audio-visual immersion and audio- 
haptic integration able to form our ecology of everyday or musical sounds. Finally, 
the potentials of VR to transport artists and spectators into a world of imagination and 
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unprecedented expression is taken as an exemplar of what multimodal and immersive 
experiences can elicit in terms of emotional and rational engagement. 

In the following, a summary for each chapter is provided to help the reader to 
follow the proposed narrative structure. 


Part I 


Chapter 1 illustrates the editors’ vision of the SIVE research field. The main concept 
introduced here is the egocentric audio perspective in a technologically mediated 
environment. The listeners should be entangled with their auditory digital twins in 
a participatory and enacted exploration for sense-making characterized by a person- 
alized and multisensory first-person spatial reference frame. Intra-actions between 
humans and non-human agents/actors dynamically and fluidly determine immersion 
and coherence of the experience, participatively. SID aims to facilitate the diffraction 
of knowledge in different tasks and contexts. 


Part II 


Chapter 2 addresses the first building block of SIVE, i.e., the modeling and synthesis 
of sound sources, focusing on procedural approaches. Special emphasis is placed on 
physics-based sound synthesis methods and their potential for improved interactivity 
concerning the sense of presence and embodiment of a user in a virtual environment. 

In Chap. 3, critical challenges in auralization systems in virtual reality and games 
are identified, including progressing from modeling enclosures to complex, general 
scenes such as a city block with both indoor and outdoor areas. The authors provide 
a general overview of real-time auralization systems, their historical design and 
motivations, and how novel systems have been designed to tackle the new challenges. 

Chapter 4 deals with the concepts of adaptation in a binaural audio context, consid- 
ering first the adaptation of the rendering system to the acoustic and perceptual prop- 
erties of the user, and second the adaptation of the user to the rendering quality of the 
system. The authors introduce the topics of head-related transfer function (HRTF) 
selection (system-to-user adaptation) and HRTF accommodation (user-to-system 
adaptation). 

Finally, Chap. 5 concludes the second part of the book by introducing audio 
reproduction techniques for virtual reality, the concepts of audio quality, and quality 
of the experience in VR. 
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Part III 


Chapter 6 opens the third part of the book devoted to SID within virtual environments. 
In particular, it deals with space, a fundamental feature of VR systems, and more 
generally, human experience. In this chapter, the authors propose a typology of VR 
interactive audio systems, focusing on the function of systems and the role of space 
in their design. Spatial categories are proposed to be able to analyze the role of space 
within existing interactive audio VR products. 

Chapter 7 promotes the following great opportunities offered by VR systems: to 
bring experiences, technologies, and users’ physical and experiential bodies (soma) 
together, and to study and teach these open-ended relationships of enaction and 
meaning-making in the framework of soma design. In this chapter, the authors 
introduce soma design and focus on design exemplars that come from physical 
rehabilitation applied to sonic interaction strategies. 

Then, Chap. 8 investigates how to design the user experience without being detri- 
mental to the creative output, and how to design spatial configurations to support 
both individual creativity and collaboration. The authors examine user experience 
design for collaborative music-making in shared virtual environments, giving design 
implications for the auditory information and the collaborative facilitation. 

Finally, Chap. 9 explores the possibilities in content creation like spatial music 
mixing, be it in virtual spaces or for surround sound in film and music, offered 
by the development of VR systems and multimodal simulations. Authors present 
some design aspects for mixing in VR, investigating existing virtual music mixing 
products, and creating a framework for a virtual spatial-music mixing tool. 


Part IV 


Chapter 10 helps the reader to understand how sound enhances, substitutes, or modi- 
fies the way we perceive and interact with the world. This is an important element 
when designing interactive multimodal experiences. In this chapter, Stefania presents 
an overview of sound in a multimodal context, ranging from basic experiments in 
multimodal perception to more advanced interactive experiences. 

Chapter 11 focuses on audiovisual experiences, by discussing the idea of immer- 
sion, and by providing an experimental paradigm that can be used for assessing 
immersion. The authors highlight the factors that can influence immersion and they 
differentiate immersion from the quality of experience (QoE). The theoretical impli- 
cations for conducting experiments on these aspects are presented, and the authors 
provide a case study for subjective evaluation after assessing the merits and demerits 
of subjective and objective measures. 

Chapter 12 focuses on audio-haptic experiences, being concerned with haptic 
augmentations having effects on auditory perception, for example, about how 
different vibrotactile cues may affect the perceived sound quality. The authors 
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review the results of different experiments showing that the auditory and somatosen- 
sory channels together can produce constructive effects resulting in a measurable 
perceptual enhancement. 

Finally, Chap. 13 examines the special case of virtual music experiences, with 
particular emphasis on the performance with Immersive Virtual Musical Instruments 
(IVMI) and the relation between musicians and spectators. The authors assess in 
detail the several technical and conceptual challenges linked to the composition 
of IVMI performances on stage (i.e., their scenography), providing a new critical 
perspective. 


We hope the reader finds this book informative and useful for both research and 


practice with sound. 


Udine, Copenhagen Michele Geronazzo 
September 2021 Stefania Serafin 
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Part I 
Introduction 


Chapter 1 A) 
Sonic Interactions in Virtual get 
Environments: The Egocentric Audio 
Perspective of the Digital Twin 


Michele Geronazzo and Stefania Serafin 


Abstract The relationships between the listener, physical world, and virtual envi- 
ronment (VE) should not only inspire the design of natural multimodal interfaces 
but should be discovered to make sense of the mediating action of VR technologies. 
This chapter aims to transform an archipelago of studies related to sonic interactions 
in virtual environments (SIVE) into a research field equipped with a first theoret- 
ical framework with an inclusive vision of the challenges to come: the egocentric 
perspective of the auditory digital twin. In a VE with immersive audio technolo- 
gies implemented, the role of VR simulations must be enacted by a participatory 
exploration of sense-making in a network of human and non-human agents, called 
actors. The guardian of such locus of agency is the auditory digital twin that fosters 
intra-actions between humans and technology, dynamically and fluidly redefining 
all those configurations that are crucial for an immersive and coherent experience. 
The idea of entanglement theory is here mainly declined in an egocentric spatial 
perspective related to emerging knowledge of the listener’s perceptual capabilities. 
This is an actively transformative relation with the digital twin potentials to create 
movement, transparency, and provocative activities in VEs. The chapter contains an 
original theoretical perspective complemented by several bibliographical references 
and links to the other book chapters that have contributed significantly to the proposal 
presented here. 
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1.1 Introduction 


Our daily auditory experience is characterized by immersion from the very beginning 
of our life inside the womb, actively listening to sounds surrounding us from different 
positions in space. Auditory information takes the form of a binaural continuous 
stream of messages to the left and right ears, conveying a compact representation of 
the omnidirectional source of learning for our existence [19, 48]. Both temporal and 
spatial activity of sounds of interest (e.g., dialogues, alarms, etc.) allow us to localize 
and encode the contextual information and intentions of our social interaction [1]. 

The hypothesis that our daily listening experience of sounding objects with cer- 
tain physical characteristics dynamically shapes the acoustic features for which we 
ascribe meaning to our auditory world is supported by one of the key concepts 
in Husserl’s phenomenology “Meaning-bestowal” (“Sinngebung” in German [73]) 
and by studies in ecological acoustics such as [48, 54, 96]. In particular, the idea of 
acoustic invariant as a complex pattern of change for a real-world sound interaction is 
strongly related to human perceptual learning and a socio-cultural mediation dictated 
by the real world. For some surveys of classical studies on the topic of ecological 
acoustics refer to [112]. 

From this perspective, acoustic invariants are learned on an individual basis 
through experiential learning. Hence, there is the need to trace their development 
over multiple experiences and to formalize a common ground for a dynamic expan- 
sion of individual knowledge. Any emerging understanding should be transferred to 
a technological system able to provide an immersive and interactive simulation of 
a sonic virtual environment (VE). Such a process must be adaptive and dynamic to 
ensure a level of coupling between user and technology in such a way that the active 
listening experience is considered authentic. 

Immersive virtual reality (here we generically referred to as VR) technologies 
allow immense flexibility and increasing possibilities for the creation of VEs with 
relationships or interactions that might be ontologically relevant even if radically 
different from the physical world. This can be evident by referring to the distinction 
between naturalistic and magical interactions, where the latter can be considered 
observable system configurations in the domain of artificial illusions, incredibly 
expanding the spectrum of possible digital experiences [13, 127]. 

One of the main research topics in the VR and multimedia communities is ren- 
dering. For decades, computer-aided design applications have favored—in the first 
place—the development of computer graphics algorithms. Some of these approaches, 
e.g., geometric ray-tracing methods, have been adapted to model sound propagation 
in complex VEs (see Chap. 3 for more details). However, there has been a clear 
tendency to prioritize resources and research on the visual side of virtual reality, 
confining auditory information to a secondary and ancillary role [158]. Although 
sound is an essential component of the grammar of digital immersion, relatively 
little compared to the visual side of things has been done to investigate the role of 
auditory space and environments. Nowadays, there is increasing consensus toward 
the essential contribution of spatial sound, also in (VR) simulations [9, 102, 145]. 
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Technologies for spatial audio rendering are now able to convey perceptually plau- 
sible simulations with stimuli that are reconstructed from real-life recordings [18] or 
historical archives, as for the Cathédrale Notre-Dame de Paris before and after the 
2019 fire [79], getting closer to a virtual version indistinguishable from the natural 
reality [77]. This is made possible by a high level of personalization in modeling 
user morphology and acoustic transformations caused by the human body interact- 
ing with the sound field generated in room acoustic computer simulations [17, 78, 
114]. 

Nowadays, the boundary between technology and humans has increasingly 
blurred thanks to recent developments in research areas such as virtual and aug- 
mented reality, artificial intelligence, cyber-physical systems, and neuro-implants. 
It is not possible to easily distinguish where the human ends and the technology 
begins. For this reason, we embrace the idea of [10] who sees technology as a lens 
for the understanding of what it means to be human in a changing world. We can 
therefore consider the phenomenal transparency [94] where technology takes on the 
role of a transparent mediator for self-knowledge. According to Loomis [88], the 
phenomenology of presence between physical and virtual environments places the 
internal listener representation created by the spatial senses and the brain on the same 
level. Human-technology-reality relations are thus created by enactivity that allows 
a fluid and dynamic entanglement of all the involved actors. 

In this chapter, we initially adopt Slater’s definition of presence for an immersive 
VR system [135] embracing the recent revision by Skarbez [134]. The concepts of 
plausibility illusion and place illusion are central to capturing the subjective internal 
states. While the plausibility illusion determines the overall credibility of a VE in 
terms of subjective expectations, the place illusion establishes the quality of having 
sensations of being in areal place. They are both fundamental in providing credibility 
to a digital simulation based on individual experience and expectations concerning 
an internal frame of reference for scenes, environments, and events. ! 

We propose a theoretical framework for the new field of study, namely Sonic Inter- 
actions in Virtual Environments (SIVE). We suggest from now on a unified reading 
of this chapter with references and integrations from all chapters of the correspond- 
ing book [49]. Each chapter provides state-of-the-art challenges and case studies 
for specific SIVE-related topics curated by internationally renowned scientists and 
their collaborators. The provided point of view focuses on the relations between real 
auditory experience and technologically mediated experiences in immersive VR. 
The first is characterized by individuality to confer immersiveness within a physical 
world. It is important to emphasize the omnidirectionally of auditory information 
that allows the listener to collect both the whole and the parts at 360°. The indi- 
vidualized auditory signals are the result of the acoustic transformations made by 
the head, ear, and torso of the listener that act as a spatial fingerprint for a complex 
spatio-temporal signal. Familiarity, and therefore previous experience with sounds, 
shape spatial localization capabilities with high intersubjectivity. Finally, studies on 


l For a dedicated discussion on the basic notions related to presence, please refer also to Chap. 11 
in this volume. 
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neural plasticity of the human brain confirm continuous adaptability of listening with 
impaired physiological functions, e.g., a hearing loss, and with electrical stimulation, 
e.g., via cochlear implants [82]. 

The mediated VR experience is often characterized by the user’s digital coun- 
terpart called avatar. It allows the creation of an embodied and situated experience 
in digital VEs. The scientific literature supports the idea that the manipulation of 
VR simulations can induce changes at the cognitive level [124], such as in educa- 
tional [34] and therapeutic [106] positive effects. The ability of VR technologies to 
mediate within the immersive environment in embodied and situated relations gives 
immersive technologies the opportunities to change one’s self [151]. 

For these reasons, we believe it is time to coin, at the terminological level, a new 
perspective that relates the two listening experiences (i.e., real and virtual), called 
egocentric audio perspective. In particular, we refer to the term audio to identify 
an auditory sensory component, implicitly recalling those technologies capable of 
immersive and interactive rendering. The term egocentric refers to the perceptual 
reference system for the acquisition of multisensory information in immersive VR 
technologies as well as the sense of subjectivity and perceptual/cognitive individ- 
uality that shape the self, identity, or consciousness. In accordance with Husserl’s 
phenomenology, the human body can be philosophically defined as a “Leib”, a living 
body, and a “Nullpunkt”, a zero-point of reference and orientation [73]. 

This perspective aims to extend the discipline of Sonic Interaction Design [44] by 
taking into account not only the importance of sound as the main channel conveying 
information, meaning, aesthetic, and emotional qualities, but rather an egocentric per- 
spective of entanglement between the perceiving subject and the computer simulating 
the perceived environment. In the first instance, this can be described by processes of 
personalization, adaptation, and mutual relations to maintain the immersive illusion. 
However in this chapter, we will try to argue that it is much more than that. We hope 
that our vision will guide the development of new immersive audio technologies and 
conscious use of sound design within VEs. 

The starting point of this theoretical framework is an ecologically egocentric per- 
spective. The foundational phenomenological assumption considers a self-propelled 
entity with agency and intentionality [47]. It can interact with the VE being aware 
of its activities in a three-dimensional space. The active immersion in a simulated 
acoustic field provides it meaningful experiences through sound. 

Therefore, it is important to introduce a terminological characterization of what 
is the listener, not a user in this context, as a human being with prior experience 
and subjective auditory perception. A closely related entity is the auditory digital 
twin, which differs from the most common avatar. The idea of an avatar within a 
digital simulation co-located with objects, places, and other avatars [126] requires a 
user taking control of any form of virtual bodies which might be noticeably different 
from that of the listenerah™s physical body. On the other hand, the digital twin 
cannot disregard an egocentric perspective of the listener for whom it is created. This 
means that the relations with the VEs should consider personalization techniques on 
the virtual body closely linked to the listener’s biological body. This mediation is 
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essential for the interactions between the listener and all the diegetic sounds, whether 
they are produced by the avatar’s gestures or by sound sources in the VE. 

In such a context, immersiveness is a dynamic relationship between physical 
and meaningful actions by the listener in the VE. Specifically, having performed 
bodily practices such as walking, sitting, talking, grasping, etc. provide meaning to 
virtual places, objects, and avatars [59]. Accordingly, the sense of embodiment can 
be considered a subjective internal feeling which is an expression of the relationship 
between one’s self and such VE. In this regard, Kilteni et al. [80] identified the 
sense of embodiment for an artificial body (i.e., avatar) in the mediation between the 
avatar’s properties and their processing by the user’s biological properties. 

We now introduce the technological mediation in the form of an auditory digital 
twin which is a guardian and facilitator of (i) the sense of self-location, (ii) plau- 
sibility, (iii) body ownership, and (iv) agency for the listener. In the first instance, 
a performative view might make us see realities as “ a doing”, enacting practical 
actions [6, 104]. Similarly, the listener and the avatar cannot be considered fixed 
and independent interacting entities, but constituent parts of emergent, multiple and 
dynamic phenomena resulting from entangled social, cognitive, and perceptual ele- 
ments. This intra-systemic action of entangled elements dynamically constructs 
identities and properties of the immersive listening experience. The illusory perma- 
nence of auditory immersion lies in the boundaries between situationally entangled 
elements in fluid and dynamic situations. They can be seen as confrontations occur- 
ring exactly in the auditory digital twin that facilitates the phenomenon. The auditory 
digital twin is the meeting and shared place between the listener and a virtual body 
identity, communicating in a non-discursive (performative) way according to the 
quality level of the digital simulation. 

In an immersive VE, the listeners cannot exist without their auditory digital twin 
and vice versa. Through the digital twin characterization, the acoustic signals gen- 
erated by the VE are filtered exclusively for the listeners, according to their ability 
to extract meaningful information. It is worthwhile to mention the participatory 
nature of such entanglement process between listener and digital twin, as a joint 
exploration of the listener’s attentional process in selecting meaningful information, 
e.g., the cocktails party effect [20]. We might speculate by considering a simulation 
that interacts within the digital twin to provide the best pattern or to discover it in 
order to attract the listener’s attention. The decision-making process will then be the 
result of intra-action in and of the auditory digital twin. 

This chapter has three main sections. Section 1.2 gathers the different souls that 
characterize the research and artistic works in SIVE. Section 1.3 holds a central 
position by defining the constitutive elements of our proposed egocentric audio per- 
spective in SIVE: spatial centrality and entanglement between human and computer 
in the digital twin. In Sect. 1.4, we attempt incorporating this theoretical framework 
by adapting Milgram and Kishino’s well-known taxonomy for VR [95], with an 
audio-first perspective. Finally, Sect. 1.5 concludes this chapter by encouraging a 
new Starting point for SIVE. We suggest an inclusive approach to the next paradigm 
shift in the field of human-computer interaction (HCI) discipline. 
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1.2 SIVE: From an Archipelago to a Research Field 


This chapter aims to provide an interpretation to an archipelago of researches from 
different communities such as 


e Sound and Music Computing (SMC) network, a point of convergence for different 
research disciplines mainly related to digital processing of musical information.” 

e International Community for Auditory Display (ICAD), a point of convergence 
for different areas of research with digital processing of non-musical audio infor- 
mation and the idea of sonification in common.’ 

e The Audio Engineering Society (AES), the main community for institutions and 
companies devoted to the world of audio technologies.* 

e The research community gathered by the International Conference on New Inter- 
faces for Musical Expression (NIME), devoted to interactions with new interfaces 
with the aim at facilitating the human creative process.” 

e The Digital Audio Effects community (DAFX) aiming at designing technological- 
based simulations of sonic phenomena.° 


We employ here the metaphor of an archipelago because it well describes a context 
in which all these communities address aspects of VR according to their specificities, 
influencing each other. After all, they share the same “waters”. They are relatively 
close to each other but feeling distant from a VR community at the same time, like 
the islands of an archipelago in the open sea. Thus, we affirm the need to unify 
the fragmentary and specificity of those studies and to fill the gap with their visual 
counterpart’s aiming at developing immersive VR environments for sound and music. 
To achieve this goal, the editors have pursued the following spontaneous path that is 
characterized by three main steps. 


1. The first review article related to SIVE topics, dated back to 2018 [128], focused 
on the technological components characterizing an immersive potential for inter- 
active sound environments. In that work, the editors and their collaborators pro- 
duced a first compact survey including sound synthesis, propagation, rendering, 
and reproduction with a focus on the ongoing development of headphone tech- 
nologies. 

2. Two years later, we published a second review paper together with all the organiz- 
ers of the past five editions of the IEEE Virtual Reality’s SIVE workshop [129]. 
In this paper, we analyzed the contributions presented at the various editions 
highlighting the emerging aspects of interaction design, presence, and evalua- 
tion. An inductive approach was adopted, supported by a posteriori analysis of 
the characterizing categories of SIVE so far. 


2 https://smenetwork.org/ 
3 https://icad.org/ 

4 https://www.aes.org/ 

5 https://nime.org 

6 https://www.dafx.de/ 
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Fig. 1.1 The SIVE inverse pyramid. Arrows indicate high-level relational hierarchies 


3. Finally, this book and, in particular, this chapter want to raise the bar further 
with an organic and structured narrative of an emerging discipline. We aim to 
provide a theoretical framework for interpreting and accompanying the evolution 
of SIVE, focusing on the close relationship between physically real and virtual 
auditory experiences described in terms of immersive, coherent, and entangled 
features. 


This chapter is the result of the convergence of two complementary analytical 
strategies: (1) a top-down approach describing the structure given by the editors to 
the book originated from the studies experienced by the editors themselves, and 
(11) a bottom-up approach drawing on the knowledgeable insights of the contributing 
authors of this book on several specialist and interdisciplinary aspects. Consequently, 
we will constantly refer to these chapters in an attempt to provide a unified and long- 
term vision for SIVE. 

Our proposal for the definition of a new research field starts from a simple layer 
structure without claiming to be exhaustive. The graphical representation in Fig. 1.1 
is capable of giving an overview and a rough inter-relation of the multidisciplinarity 
involved in SIVE. We suggest a hierarchical structure for the various disciplines in 
the form of an inverted pyramid representation. SIVE research can be conceptually 
organized in three levels: 


i Immersive audio concerns the computational aspects of the acoustical-space 
properties of technologies. It involves the study of acoustic aspects, psychoa- 
coustic, computational, and algorithmic representation of the auditory informa- 
tion, and the development of enabling audio technologies; 

ii Sonic interaction refers to human-computer interplay through auditory feed- 
back in 3D environments. It comprises the study of vibroacoustic information 
and its interaction with the user to provide abstract meanings, specific indicators 
of the state for a process or activity in interactive contexts; 
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iii The integration of immersive audio in multimodal VR/AR systems impacts 
different application domains. This third and final level collects all the studies 
regarding the integration of virtual environments in different application domains 
such as rehabilitation, health, psychology, music, to name but a few. 


The immersive audio layer is a strongly characterizing element of SIVE. For such 
a reason, it is placed as the tip of the inverse pyramid, where all SIVE development 
opportunities originate. In other words, SIVE cannot exist without sound spatializa- 
tion technologies, and the research built upon them is intrinsically conditioned by the 
level of technological development (for more arguments on this issue see Sect. 1.3.2). 

In particular, spatial audio rendering through headphones involves the computa- 
tion of binaural room impulse responses (BRIRs) to capture/render sound sources in 
space (see Fig. 1.2). BRIRs can be separated into two distinct components: the room 
impulse response (RIR), which defines room acoustic properties, and the head-related 
impulse response (HRIR) or head-related transfer function (HRTF, i.e., the HRIR in 
the frequency domain), which acoustically describes the individual contributions of 
the listener’s head, pinna, torso, and shoulders. The former describes the acoustic 
space and environment, while the latter prepares this information into perceptually 
relevant spatial acoustic cues for the auditory system, taking advantage of the flex- 
ibility of immersive binaural synthesis through headphones and state-of-the-art 
consumer head-mounted displays (HMDs) for VR. The perceptually coherent aural- 
ization with lifelike acoustic phenomena, taking into account the effects of near-field 
acoustics and listener specificity in user and headphones acoustics, is a key techno- 
logical matter here [11, 21, 68]. 


Binaural Room Impulse Response (BRIR) 


Spatial Room Impulse Head-Related Impulse 
Response (SRIR) Response (HRIR) 


Room Acoustics Headphones Listener’s body 


Headphone Impulse |................ sd 
Response (HpIR) 


Fig. 1.2 High-level acoustic components for immersive audio with a focus on spatial room acoustics 
and headphone reproduction 
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The visual component of spatial immersion is so evident that it may seem that the 
sensation of immersion is exclusively dependent on it, but the aural aspect has as much 
or even more relevance. We can simulate an interactive listening experience within 
VR using standard components such as headsets, digital signal processors (DSPs), 
inertial sensors, and handheld controllers. Immersive audio technologies have the 
potential to revolutionize the way we interact socially within VR environments and 
applications. Users can navigate immersive content employing head motions and 
translations in 3D space with 6 degrees of freedom (DoF). When immersive audi- 
tory feedback is provided in an ecologically valid interactive multisensory experi- 
ence, a perceptually plausible scheme for developing sonic interactions is practically 
convenient [128], yet still efficient in computational power, memory, and latency 
(refer to Chap. 3 for further details). The trade-off between accuracy and plausibility 
is complex and finding algorithms that can parameterize sound rendering remains 
challenging [62]. The creation of an immersive sonic experience requires 


e Action sounds: sound produced by the listener that changes with movement, 

e Environmental sounds: sounds produced by objects in the environment, referred 
to as soundscapes, 

e Sound propagation: acoustic simulation of the space, i.e., room acoustics, 

e Binaural rendering: user-specific acoustics that provides for auditory localization. 


These are the virtual acoustics and auralization key elements [153] at the basis of 
auditory feedback design that draws on user attention and enhances the sensation of 
place and space in virtual reality scenarios [102]. 

The two upper layers of the SIVE inverse pyramid, i.e., sonic interactions and 
multimodal experiences, are not clearly distinguishable and we propose the following 
interpretation: we differentiate the interaction from the experience layer when we 
intend to extrapolate design rules for the sonic component with a different meaning 
for the designer, system, users, etc. . In both cases, embodiment and proprioception 
are essential, naturally supporting multimodality in the VR presence. This leads us 
to a certain difficulty in generalizations which is well-grounded by our egocentric 
audio perspective. In our proposal of theoretical framework, the hierarchies initially 
identified can change dynamically. 

Ernst and Biilthoff’s theory [41] suggests how our brain combines and merges 
different sources of sensory information. The authors described two main strategies: 
sensory combination and integration. The former aims at maximizing the informa- 
tion extraction from each modality in a non-redundant manner. The second aims 
at finding congruence and reducing variability in the redundant sensory informa- 
tion in search of greater perceptual reliability. Both strategies consider a bottom-up 
approach to sensory integration. In particular, the concept of dominance is associ- 
ated with perceptual reliability from each specific sensory modality given the specific 
stimulus. This means that the main research challenge for SIVE is not only to foster 
research aimed at understanding how humans process information from different 
sensory channels (psychophysics and neuroscience domains), but especially how 
multimodal VEs should distribute the information load to obtain the best experi- 
ence for each individual. Accordingly, we assume that each listener has personal 
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optimization strategies to extract meaning from redundant sensory information dis- 
tributions. The VR technology can improve if and only if it can have a sort of dialogue 
with the listener to understand such a natural mixture of information. 

The design process of multimodal VEs must also constantly take into account the 
limitations, i.e., the characterization, of the VR technologies with the aim at creating 
real-time interactions with the listener. According to Pai [108], interaction models 
can be described as a trade-off between accuracy and responsiveness. Increasing the 
descriptive power and thus the accuracy of a model for a certain phenomenon leads to 
processing more information before providing an output in response to a parametric 
configuration. It comes at the price of higher latency for the system. For multisensory 
models that should synchronize different sensory channels, this is crucial and has to 
be carefully balanced with many other concurrent goals. 

Understanding interactions between humans and their everyday physical world 
should not only inspire the design of natural multimodal interfaces but should be 
directly explored into VE models and simulation algorithms. This message is strongly 
supported by Chap. 10 and our theoretical framework fully integrates this vision by 
trying to further extend this perspective to non-human agents. The role of the digital 
simulation and the computer behind it is participation and discovery for the listener. 
They constitute a complex system whose interactions contribute to the dynamic def- 
inition of non-linear narratives and causal relationships that are crucial for immer- 
sive experiences. The application contexts of the interactive simulations instruct the 
trade-off between the accuracy and responsiveness models. Hence, the knowledge 
of the perceptual-cognitive listener capabilities emerges as active transformations in 
multimodal digital VR experiences. 


1.3 Egocentric Audio 


A large body of research in computational acoustics focused on the technical chal- 
lenges of quantitative accuracy characterizing engineering applications, simulations 
for acoustic design, and treatment in concert halls. Such simulations are very expen- 
sive in terms of computational resources and memory, so it is not surprising that 
the central role of perception in rendering has gradually come into play. The search 
for lower bounds such as the perceptually authentic audio-visual renderings can 
be achieved (see Chap. 5 for a more detailed discussion). Continuous knowledge 
exchange between psychophysical research and interactive algorithms development 
allows to test new hypotheses and propose responsive VR solutions. It is worthwhile 
to mention the topic of artificial reverberations and modeling of the reverberation 
time aiming to provide a sense of presence through the main spatial qualities of a 
room, e.g., its size [83, 147]. 

In the context of SIVE, we could review and adapt the three paradigm shifts, or 
“waves” in HCI mentioned by Harrison [64], which still coexist and are at the center 
of research agendas for different scientific communities. The first wave considers 
the optimization of interaction in terms of the human factor in an engineered system. 
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We could mention as an example the ergonomic, but generic, “ one fits all” solutions 
of dummy-heads and binaural microphones for capturing acoustic scenes [110]. The 
second wave introduces a connection between man and machine in terms of infor- 
mation exchange, looking for similarities and common ground in decision-making 
processes, e.g., memory and cognition. The structural inclusion of non-linearities 
and auditory Just-Noticeable Differences (JNDs) to determine the amount of infor- 
mation to be encoded for gesture sonification is an example of this direction [38]. 
Finally, the third paradigm shift considers interaction as a situated, embodied, and 
social experience, characterized by emotions and complex relations encountered in 
everyday life. We could place here many of the case studies collected in this volume 
(Parts III and IV). To this regard, the extracted patterns or best practices are often 
very specific to each study and listeners’ groups, e.g., musician vs. non-musician 
(Chap. 9). 

From developments in phenomenological [93] and, more recently, 
post-phenomenological thinking [74, 150], we will therefore develop the egocen- 
tric audio perspective. The key principle is the shift between interaction between 
defined objects to intra-action within a phenomenon whose main actors are human 
and non-human agents. Boundaries between actors are fluidly determined, similarly 
to the Gibsonian ecological theory of perception [54, 55]. Even though this is a 
shift from an anthropocentric and user-centered view toward a system of enactive 
relations and associations in the immersive world of sounds, we chose the term ego- 
centric to emphasize the spatial anchoring between humans and technology in the 
self-knowledge constitution. 

It would be useful also referring to the concept of ambiguity by the philosopher 
Maurice Merleau-Ponty that says that all experiences are ambiguous, composed of 
things that do not have defined, identifiable essence, but rather by open or flexi- 
ble styles or patterns of interactions and developments [93, 123]. Starting from an 
egocentric spatial perspective of immersive VR, the learning and transformation 
processes of the listeners occur when their attention is guided toward external vir- 
tual sounds, e.g., the out-of-the-head and externalized stimuli. This allows them to 
achieve meaningful discoveries also for their auditory digital twins. Accordingly, the 
experience mediated by a non-self, i.e., auditory simulation of VEs, is shaped (i) by 
the past experience of the listener and the digital twin indistinctly acquired from a 
physical or cybernetic world in a constructivist sense, (ii) by the physical-acoustic 
imprinting induced or simulated by the body, head and ears, and (iii) by active 
and adaptive processes of perceptual re-learning [57, 160] induced by a symbiosis 
with technology. Figure 1.3 schematizes and simplifies this relationship between 
man-technology-world from which the listener acquires meaning. As pointed out by 
Vindenes and Wasson [151], experiences are mediated in a situated way from the 
subjectivity of the listener which constitutes herself in relation to the objectivity of 
the VE. Having placed the physical and virtual worlds at the same level yields to 
similar internal representations for the listener and her digital twin, allowing us to 
promote the transformative role of VR experiences for a human-reality relationship 
altered after exposure. 
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Fig. 1.3 Technological mediation of the auditory digital twin (adapted from Hauser et al. [66]) 


The core of our framework is an ideal auditory digital twin: an essential mediator 
and existential mirror for an egocentric audio perspective. Technology is the mediator 
of this intentional relationship co-constituting both the listener and her being in the 
world. From this post-phenomenological perspective of SIVE, we are interested in 
understanding how the VE relates to the listener and what is the meaning of the VEs 
for the listener, at the same time. Our main goal is to characterize the mediating 
action between the listener and the VE by an auditory digital twin. This guardian can 
reveal the listener’s ongoing reconfiguration through the human-world relationship 
occurring outside the VR experience. 

In the remainder of this chapter, we will motivate the opportunity to refer to this 
non-human entity other than the self and aspiring to be the mediator for the self. This 
first philosophical excursus of hermeneutical nature allows us to take a forward- 
looking vision for the SIVE discipline, framing the current state of the art but also 
including the rapid technological developments and ethical challenges due to the 
digital transformation. 


1.3.1 Spatial Centrality 


The three-dimensionality of the action space is one of the founding characteristics of 
immersive VE. Considering such space of transmission, propagation, and reception 
of virtually simulated sounds, sonic experiences can assume different meanings and 
open up to many opportunities. 

Immersive audio in VR can be reproduced both through headphones and loud- 
speaker arrays determining a differentiation between listener- and loudspeaker- 
centric perspectives. The latter seems to decentralize the listener role in favor of 
a strong correlation between virtual and physical (playback) space. In particular, 
sound in VEs is decoded for the specific loudspeaker arrangements in the physi- 
cal world (for a summary of the playback systems refer to Chap. 5). This setup 
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allows the coexistence of several listeners in the controlled playback space, depend- 
ing on the so-called sweet spot. However, the VE and the listener-avatar mapping is 
intrinsically egocentric and multisensory, subordinating a loudspeaker-centric per- 
spective for the simulation of the auditory field to a listener-centric one. Let us try 
to clarify this idea with a practical example: head movements and the navigation 
system, e.g., redirected walking [101], determine the spatial reference changes for 
the real/virtual environment mapping corresponding to the listener’s dynamic explo- 
ration. The tracking system could trigger certain algorithmic decisions to maintain 
the place and plausibility illusions of the immersive audio experience. 


1.3.1.1 First Person Point of View 


In this theoretical framework, we focus on the listener’s perspective, where sound is 
generated from the first-person point of view (generally referred to as 1PP). Virtual 
sounds are shaped by spatial hearing models: auralization takes into account the 
individual everyday listening experience both in physical-acoustic and non-acoustic 
terms. Contextual information relate spatial positions between sound events and 
objects with the avatar virtual body, creating a sense of proximity and meaningful 
relations for the listener. 

It is relevant to stress the connection between the egocentric audio perspective 
and the research field of egocentric vision that has more than twenty-year history. 
The latter is a subfield of computer vision that involves the analysis of images and 
videos captured by wearable cameras, e.g., Narrative Clips’ and GoPro’, considering 
an approximation of the visual field due to a 1PP. From this source of information, 
spatio-temporal visual features can be extracted to conduct various types of recogni- 
tion tasks, e.g., of objects or activities [100], and analysis of social interactions [2]. 
The egocentric audio perspective originates from the same | PP in which both space 
and time of events play a fundamental role in the analysis and synthesis of sonic inter- 
actions. Furthermore, we stress the idea that all hypotheses and evaluations in both 
egocentric vision and audition are individually shaped around a human actor. How- 
ever, our vision does not focus exclusively on the analysis of the listener behaviors 
but includes generative aspects thanks to the technological mediation of the spatial 
relations between humans and VEs (these aspects will be extensively discussed in 
Sect. 1.3.2). 

Using a simplification adopted in Chap. 2 concerning the work by Stock- 
burger [140] on sounds in video games, we can distinguish two categories for sound 
effects: (i) those related to the avatar’s movements and actions (e.g., footsteps, knock- 
ing on a door, clothing noises, etc.) and (ii) the remaining effects produced by the VE. 
In this simple distinction, it is important to note that all events are echoic, i.e., they 
produce delays and resonances imprinted by the spatial arrangements of the avatar- 
VE configurations depending on the acoustical characteristics of the simulated space. 


T http://getnarrative.com/ 
8 https://gopro.com/ 
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Moreover, all events should be interpreted by the listener’s memory which is shaped 
by the natural everyday reality. 

Finally, it is worthwhile to notice that egocentric 1PP poses novel challenges in 
the field of cinematic VR narration or more generally of storytelling in VR. Gédde et 
al. [56] identified immersive audio as an essential element able to capture attention 
on events/objects outside the field of view. The distinction between the active role 
of the listener interacting with the narrative or passive role as an observer raises 
interesting questions about the spatial and temporal positioning of scenic elements. 
The balance between environment, action, and narration is delicate. Citing Gédde and 
collaborators, one “can only follow a narrative sufficiently when temporal and spatial 
story density are aligned with each other”. Hence, the spatio-temporal alignment of 
sound is crucial. 

For most researchers interested in sound, from the neurological to the aesthetic- 
communicative level, it is clear that while the visual object exists primarily in space, 
the auditory stimulus occurs in time. Therefore, it is not surprising that in order to 
speak of spatial centrality in audio we need to consider presence, the central attribute 
for a VR experience. In his support of a representational view of it, Loomis [88] 
cites two scientists with two opposite opinions: Willian Warren and Pavel Zahorik, 
the first an expert in visual VR and the latter in acoustic VR. The former supports 
a rationalist view of representational realism and direct perception [154], while the 
latter supports the ecological perspective in the fluidity in perception-action [159].° 
The second perspective supports the concept of enaction such that it is impossible to 
separate perception from action in a systematic way. Perception is inherently active 
and reflexive in the self. Recalling Varela, another leading supporter of this perspec- 
tive [148], experience does not happen within the listener but is instead enacted by the 
listener by exploring the environment. Accordingly, we consider an embodied, envi- 
ronmentally situated perceiver where sensory and motor processes are inseparable 
from the exploratory action in space. At first glance, such a view restricts experi- 
ences to only those generated by specific motor skills which are in turn induced by 
biological, psychological, and cultural context. However, it is generally not true in a 
digital-twin-driven VE (see Sect. 1.4.3). 


1.3.1.2 Binaural Hearing 


The geometric and material features of the environment are constituent elements of 
the virtual world that must be simulated in a plausible way for that specific listener. 
First of all, the listener-environment coupling is unavoidable and must guarantee 
as good sound localization performances as to maintain immersiveness. It has to 
especially avoid the inside-the-head spatial collapse, i.e., when the virtual sound 
stimuli are perceived inside the head, a condition opposite to the natural listen- 
ing experience of outside-the-head localization for surrounding sound sources, also 


° Atherton and Wang [4] recently developed a similar view point comparison and proposed a set of 
design principles for VR, born from the contrast between “doing vs, being”. 
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called externalization [131]. Externalization can be considered a necessary but not 
sufficient condition for the place illusion, being immersed in that virtual acoustic 
space. For a recent review of the literature on this topic, Best et al. [8] suggest that 
ambient reverberation and sensorimotor contingencies are key indicators for elicit- 
ing a sense of externalization, whereas HRTF personalization and consistent visual 
information may reinforce the illusion under specific circumstances. However, the 
intra-action between these factors is so complex that no univocal priority princi- 
ples can be applied. Accordingly, we should explore dynamic relations depending 
on specific links between evolving states of the listener-VE system during the VR 
experiences. Moreover, huge individual-based differences in the perception of exter- 
nalization require in-depth exploration of several individual factors such as monaural 
and binaural HRTF spectral features, temporal processes of adaptation [27, 65, 146]. 

Binaural audio and spatial hearing have been well-established research fields for 
more than 100 years and have received relevant contributions from information and 
communications technologies (ICT) and in particular from digital signal process- 
ing. Progress in digital simulations has made it possible to replicate with increasing 
accuracy the acoustic transformation by the body of a specific listener with very high 
spatial resolution up to sub-millimeter grids for the outer ear [113, 114]. This pro- 
cess generates acoustically personalized HRTFs so that the rendering of immersive 
audio matches the listener’s acoustic characterization (System-to-User adaptation 
in Chap. 4). On the opposite side, the VE can train and guide the listener in a pro- 
cess of User-to-System adaptation by designing ad-hoc procedures for continuous 
interaction with the VE to induce a persistent recalibration of the auditory sys- 
tem to non-individual HRTFs.!° These two approaches can be considered two poles 
between which one can define several mixed solutions. This dualism is brilliantly 
exposed and analyzed in Chap. 4. 


1.3.1.3 Quality of the Mediated Experience 


Since our theoretical framework aims to go beyond user-centricity, we approach 
the space issue from different perspectives, both user and technology perspectives, 
respectively. However, all points of view remain ecologically anchored to the egocen- 
tric 1PP of the listener giving rise to a fundamental question: how can we obtain high- 
quality sonic interactions for a specific listener-technology relation? In principle, 
many quality assessment procedures might be applied to immersive VR systems. 
However, there is no adequately in-depth knowledge of the technical-psychological- 
cognitive relationship regarding spatial hearing and multisensory integration pro- 
cesses linked to plausibility and technological mediation. 

On the other hand, a good level of standardization has been achieved for the per- 
ceptual evaluation of audio systems. For instance, the ITU recommendations focus 
on the technical properties of the system and signal processing algorithms. Chapter 5 
introduces the Basic Audio Qualities used for telecommunications and audio codecs, 


10 The HRTF selection process can potentially result from a random choice [139]. 
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commonly adopted in the evaluation of spatial audio reproduction systems. On the 
other hand, the evaluation of the listening experience quality, called Overall Lis- 
tening Experience [125], is also introduced, considering not only system technical 
performances but also listeners’ expectations, personality, and their current state. 
All these factors influence the listening of specific audio content. A related measure 
can be the level of audio detail (LOAD) [39] that attempts to manage the available 
computational power, the variation of spatio-temporal auditory resolution in com- 
plex scenes, and the perceptual outcome expected by the listener, in a dynamically 
adaptive way. 

Chapter 2 provides an original discussion on audio “quality scaling” in VR simu- 
lations, drawing the following conclusion: there is neither an unambiguous definition 
nor established models for such issues. It suggests that understanding the listener- 
simulation-playback relations is an open challenge, extremely relevant to SIVE. In 
general, the most commonly used approach is the differential diagnosis, allowing 
the qualities of VR systems to emerge from different quantitative and qualitative 
measurements. Several taxonomies for audio qualities or sound spatialization have 
given rise to several attribute collections, e.g., semantic analysis of expert surveys 
and expert focus groups (see Chap. 5 on this). It is worthwhile to mention that a sub- 
stantial body of research in VR is devoted to explore the connections between VR 
properties such as authenticity, immersion, sense of presence and neurophysiological 
measurements, e.g., electroencephalogram, electromyography, electrocardiogram, 
and behavioral measurements, e.g., reaction time, kinematic analysis. 

To summarize, this differentiation tries to capture all those factors that lead to a 
high level of presence: sensory plausibility, naturalness in the interactions, meaning 
and relevance of the scene, etc. Moreover, the sense of presence in a VR will remain 
limited if the experience is irrelevant to the listener. If the listener-environment rela- 
tion is weak, the mediating action of the immersive technology might result in a 
break in presence that can hardly be restored after a pause [136]. These cognitive 
illusions depend, for example, on the level of hearing training, familiarity with a 
stimulus/sound environment. All these aspects reinforce the term egocentric again, 
grounding auditory information to a reference system that is naturally processed and 
interpreted in 1PP. However, SIVE challenges go far beyond two opposing points 
of view, i.e., user-centered and technology-centered. In this chapter, we offer a first 
attempt at a systemic interpretation of the phenomenon. 


1.3.2 Entanglement HCI 


Heidegger’s phenomenology aims to overcome mind-body dualism by introducing 
the notion of “Dasein” which requires an embodied mind to be in the world [67]. 
The concept of embodiment became central to the third wave of HCI, e.g., in rela- 
tion to mobile and tangible user interfaces [64]. More recently, the bodily element 
has been incorporated into the theoretical framework of somaesthetics to explain 
aesthetic experiences of interaction and into design principles for bodily interac- 
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tion [71]. Designers are encouraged to participate with their lived, sentient, subjec- 
tive, purposive bodies in the process of creating human-computer interactions, either 
by improving their design skills and sensibilities, or by providing an added value 
of aesthetic pleasure, lasting satisfaction, and enjoyment to users. These elements 
are summarized in Chap. 7, which provides a useful distinction of perspectives for 
interaction design: the first-person, second-person, and third person design perspec- 
tive. The latter is equivalent to an observer approach to design such as considering 
the common practices, e.g., interview administration, subjective evaluations, and 
data analysis acquired from a variety of sensors. The second-person is equivalent 
to the user-centered and co-design approach between the user’s perspective and the 
designer’s attempt to step into the shoes of someone else. On the other hand, soma 
design principles embrace a first-person perspective, we would argue egocentric, 
even for designers, who are actively involved with their bodies during each step of 
the interaction design process of an artifact or simulation. They explicitly become 
actors themselves with the result of shaping a felt and lived experience for other 
actors. 

In the movement computing work by Loke and Robertson [87], the authors intro- 
duced another perspective distinction relevant here. The mover (first-person perspec- 
tive) and the observer (third-person perspective) are explicitly joined by the machine 
perspective. The role of technology is pivotal for the interactions with digital move- 
ment information and, in particular, for the process of attributing meaning based on 
user input. This perspective requires mapping data from sensing technologies into 
meaningful representations for the observer and the mover. It is worthwhile to note 
that machines capture the qualities of movement with considerable losses in terms of 
spatial, temporal, or range resolution, making the comprehension of such limitations 
on interaction design essential. We need to explore the various perspectives, not in a 
mutually exclusive way, but dynamically managing the analysis of the various points 
of view in every immersive experience. 

According to Verbeek [150], human-world relations are enacted through technol- 
ogy. Thus, man and technology constitute themselves as actors in a fluid reconfig- 
uration. A practical example in the field of music perception considers a drummer 
who changes her latency perception the more she plays the musical instrument [86]. 
The action of playing the drum changes the relationships that she has with the instru- 
ment itself, with the self, and with temporal aspects of the world, e.g., reaction times 
and synchronizations. 

The recent proposal of a post-phenomenological framework by Vindenes [151] 
is based on Verbeek’s concept of technological mediation, which identifies several 
human-technology relationships including immersion in smart environments, ambi- 
ent intelligence, or persuasive technologies. In particular, for the latter case, VR 
plays a central role co-participating within a mixed intentionality between humans 
and technology. Accordingly, Verbeek introduced the idea of composite intentional- 
ity for cyborgs [149], a cooperation between human and technological intentionality 
with the aim to reveal a (virtual) reality that can only be experienced by technologies, 
by making accessible technological intentionalities to human intentionality. We 
can argue that the world and the technology become one in the immersive simu- 
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lation that knows the listeners and actively interacts with them. This configuration 
becomes bidirectional: humans are directed toward technology and technology is 
directed toward them. Moreover, listeners have the opportunity to access reflective 
relationships with themselves through VEs. For example, Osimo et al. provided 
experience of the self through virtual body-swapping in the embodied perspective- 
taking [106]. We must decentralize humans as the sole source of activity and attribute 
to the material/technological world an active role in revealing new and unprecedented 
relational actions. 

This approach opens up new opportunities for “reflexive intentionality” of the 
human beings about themselves through the active relation with simulations [5]. 
About this, Verbeek [150] classifies the technological influence on humans accord- 
ing to two dimensions: visibility and strength. Some mediations can be hidden but 
induce strong limitations, while others can be manifest but have a weak impact on 
humans. There is a deep entanglement between humans and machines to the extent 
that there is no human experience that is not mediated through some kind of technol- 
ogy that shapes who we are and what we do in the world. Considering immersive VR 
technologies, we must speculate on what is a locus of agency: the understanding of 
the active contributions of each tool in the listener’s actions in VEs. Such an infras- 
tructure must be enactive and re-interpretive of each actor in each circumstance. In 
other words, there is the opportunity of becoming different actors depending on an 
active inter-dependence. 

At this point, recalling the work of Orlikowski [105] is twofold. First, she gave the 
name of entanglement theories to those heterogeneous theories that have in common 
the recognition of the active inter-dependence between socio-technological-material 
configurations with the consequence of promoting studies of man and technology 
in a unitary way. Secondly, Orlikowski supported her position with an experimental 
example of social VR, the Sun Microsystems’ Project Wonderland developed more 
than a decade ago and, nowadays, it seems more relevant than ever due to the COVID- 
19 pandemic. We will analyze a similar case in SIVE, supporting our taxonomy in 
Sect. 1.4. In this section, we focus on entanglement theories that are foundational 
for our egocentric perspective. 

The entanglement is the deep connection between men and their tools, having rel- 
evant repercussions in the field of human-computer interaction. In [45], Frauenberger 
provided the following interpretative key: we cannot design computers or interac- 
tions, we can work on facilitating certain configurations that enact certain phenom- 
ena. Both configurations and phenomena are situated and fluid, but not random. They 
are causally connected within hybrid networks in which human and non-human 
actors interact. However, it must be made clear that these actors do not possess fixed 
representations of their entities, but they exist only in their situated intra-action. This 
means that their relations and configurations are dynamically defined by the so-called 
agential cuts that draw the boundaries between entities during phenomena. In this 
network of associations, each configuration change is equivalent to a newly enacted 
phenomenon where new agential cuts are redefined or create new actors. Hence, the 
term agency refers to a performative mechanism of boundary definition and consti- 
tution of the self. Together with the post-phenomenological notion of technological 
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mediation, entangled HCI provides a lens able to interpret the increasingly fuzzy 
boundaries between humans, machines, and their distribution of agency. 

The sonic information from intentional active listening is anchored to an ego- 
centric perspective of spatiality that allows the understanding of an acoustic scene 
transformed by the listener’s actions/movements. This process can be mathematically 
formalized with the active inference approach by Karl Friston and colleagues [46] and 
their recent enactive interpretation [115]. Their computational framework quantita- 
tively integrates sensation and prediction through probability and generative models 
optimizing the so-called free-energy principle, i.e., an optimization problem of a 
function of the beliefs and expectations. Following this line of thought both philo- 
sophically and mathematically, we argue that immersive audio technologies are capa- 
ble of contributing to the listener’s internal representation in both spatial and seman- 
tic terms, eliciting a strong sense of presence in VR [12]. Just as we cannot clearly 
distinguish between listener and real environment, the more we cannot distinguish 
between listener and VE. 

Therefore, the sonic interaction design in VEs is an intra-action between technol- 
ogy, concepts, visions, designers, and listeners that produce certain configurations 
and agential cuts. According to the sociological actor-network theory [28, 85], the 
network of associations characterizes the ways in which materials join together to 
generate themselves. Prior knowledge also becomes an actor in such a network that 
shapes, constrains, enables, or promotes certain activities. For example, modeling the 
listener’s acoustic contribution with measurements from a dummy head induces a cut 
that shapes the use cases and VR experiences. Similarly, agential cuts are performed 
based on knowledge from other studies. For instance, the auditory feedback supports 
the plausibility of footstep synthesis or the strategies employed in the definition of 
time windows for synchronous and embodied sensory integration [122]. Moreover, 
the physical and design features of the technology also contribute to determining 
what is feasible: e.g., the differentiation of playback systems for spatial audio results 
in differentiation in the quality of the experience (see Chap. 11). 

In the entanglement within the relational network of listener-reality-simulation, 
configurations and actors are dynamically defined in a situated and embodied manner. 
In the process of configuring and reconfiguring actors, designing various aspects, 
and operating agential cuts new knowledge is produced that causally links the 
enactment of the technological design to the phenomenon created [45]. This means 
that this knowledge has several forms, one resides in the technological artifact itself, 
i.e., in the VR simulation. In a more general sense, we could argue that exploring the 
evolution in the network configurations and actors enables an active search for the 
egocentrically meaningful experience. In line with this, agency and its responsibilities 
are not the prerogative of the listener or the technology but reside in their intra-actions. 
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1.3.3 Auditory Digital Twin 


From entanglement theories, we inherit a series of open questions that guides our 
reflection on the SIVE research field. Let’s consider the immersive VR simulation 
as the digital artifact co-defining itself with the listener who experiences it. 


How can certain transformative actions and interactions be programmed? 
Who/what is the mediator, if any, in the relationship between the physical world and the VE? 


How should such a mediator act? 


Of particular interest here is Schultze’s interpretation of the avatar [126]: a 
dynamic self-representation for the user, a form of situated presence that is variably 
implemented. Sometimes the avatar is seen as a separate entity, behaving indepen- 
dently of the user. Sometimes the listener inhabits the avatar, merging with it to such 
an extent that they feel completely immersed and present in the virtual space. From 
this variety of instances, definitions of identity (avatar vs. self), agency (technology 
vs. human), and the world (physical vs. virtual) are fluid and enacted depending on the 
situation. Moreover, we argue that avatars and listeners know very little about each 
other. Such consideration strengthens the individual experience that determines one 
tendency over the other (separation vs. union with an avatar) with difficult predictions 
and poorly generalizable interpretations. Consequently, the user characterization in 
human-centered design is somehow included here [76]. However, our view promotes 
meaningful human-technology relationships in a bidirectional manner: not only per- 
sonalized user experiences, but experiences able to shape who we really want 
to be. 

The communication between the avatar and the listener, the virtual and the physical 
is challenging. Considering the avatar as part of a VE configuration, we can formulate 
one of the initial questions: if we can handle mediation, where/who is in charge of 
that? 

Our performative perspective is questioning the a priori and fixed distinctions of 
certain representationalism between avatar and self, technology agency and listener, 
physical reality, and virtuality. These boundaries have to be drawn in situated and 
embodied action, which makes them dynamic and temporary. The exploration of how, 
when, and why agential cuts define boundaries of identity, agency, and environments 
is the core of our theoretical framework. 

We want to give a digital form to the philosophical question of the locus of 
agency: we envision a meta-environment with technological-digital nature, which 
is the guardian, careful observer, and lifeblood for the dialogue and participation of 
each actor. Its name is the auditory digital twin. In an egocentric perspective, it 
takes shape around the listener, i.e., the natural world that is meaningful to her. Why 
twin? Because this term recalls the idea of the deep connection between two different 
and distant entities or persons, commonly grounded by similarities, e.g., the DNA or 
a close friendship. Although the adjective auditory would seem to restrict our idea 
to the sound component, the framework ecologically extends to the multisensory 
domain by considering the intrinsic multisensory nature of VR. For these reasons, 
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we will provide an audio-first perspective, sometimes sacrificing the term auditory 
in favor of a more readable and synthetic expression without loss of information, i.e., 
(auditory) digital twin. 

Technical aspects of an artifact can be used to recreate a virtualized version or 
digital simulation of the artifact itself in the so-called virtual prototyping process [90]. 
Similarly, perceptual and cognitive aspects might serve to obtain digital replicas of 
biological systems, also referred to as a bio-digital twin in the field of personalized 
medicine [23]. The real person/machine provides the data that gives shape to the 
virtual one. In the case of humans, the process of quantified self [89] supports the 
modeling of the virtual digital twin, an algorithmic assistant in decision-making. 
Implications of the digital twin paradigm are already envisioned in [40]. They range 
from the continuous monitoring of patient health to the management of the agency 
in a potentially immortal virtual agent. 

In the scientific literature, the most common definition of a digital twin is related to 
a digital replica. However, we would like to provide a significant imprint to our idea 
of the auditory digital twin as a psycho-socio-cultural-material objectified actor- 
network with agential participation. As depicted in Fig. 1.4, all digitally objec- 
tifiable configurations related to listener profile, VE, HW/SW technology, design, 
ethical impact, etc. are made available to the digital twin so that it can actively 
participate intra-acting with system states. 

To understand the central role of the digital twin in SIVE, we provide some 
practical examples: 


e Links to setup configurations—Body movement tracking opens up numerous 
opportunities for dynamic rendering and customization of the listener’s acoustic 
contribution in harmony between the real and the virtual body, i.e., the avatar’s 
body. Real-time monitoring of the motion sensors is crucial to avoid a negative 
impact on responsiveness. 

e Links to listener configurations—Adaptation and accommodation processes are 
strongly situated in the task. Assuming the unavailability of individual HRTF mea- 
surements, the best HRTF model requires a dynamic analysis of each task/context 
in a mutual learning perspective between the listener and the digital twin. 

e Links to environment configurations—Persuasion of a VE for a listener behav- 
ioral change depends on social and cultural resonances within the listener. The 
distribution of agency in a music-induced mood has to be analyzed with particular 
attention. Again, certain immersive gaming experiences or role-playing may be 
beneficial for some listeners, to be avoided for others. 

e Links to configurations of others—Other entities, e.g., virtual agents or avatars 
guided by other listeners, populate VEs. To manage confrontation and sharing 
activities, the intra-action between a larger number of digital twins must be con- 
sciously encouraged. 


All these configurations are not independent but are always interconnected with 
each other. Of particular relevance here, we can consider the externalization of sound 
sources. The level of externalization depends on customization techniques of the 
spatial audio rendering, the acoustic information of the virtual room, the sensory 
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Fig. 1.4 A schematic representation of the different sound elements needed to create an immersive 
sonic experience. Colored lines identify the differences compared to the scheme proposed in [128]. In 
particular, this representation focuses on the central role of the auditory digital twin as a quantifiable 
locus of agency in an active relationship with all actors of a VR experience. The green arrow identifies 
the participatory relationship between the listener and the digital twin in its performative formation 
of individual self-knowledge 


coherence and synchronicity, and the familiarity with the situation [8]. A coordi- 
nation action of setup, environment, and listener(s) is needed. The presence in VR 
experiences will be the result of all these fluid intra-connections. 

Suchman posed a highly relevant question in [141]: how can we consider all these 
configurations in such a way that we can act responsibly and productively with and 
through them? To answer, we must deal with the participation issues for all involved 
actors. 

The egocentric perspective requires us to start from the listener and her experience. 
The scientific literature already tells us that memory, comprehension, and human 
performance benefit considerably from these VEs, especially in guided or supervised 
tasks involving human or digital agents [29]. Let us focus on the series of actions 
triggered by an active role of agents. In [31], Collins analyzed the player role in the 
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audio design of video games. The participatory nature of video games potentially 
leads to the creation of additional or completely new meanings compared to those 
originally intended by the creators and their storytelling. Hence, there is a change not 
only in the reception but also in the transmission in the communication of auditory 
information. The player becomes a co-transmitter of information introducing non- 
linearities in the experience that propagate throughout the agents’ chain of activity, 
triggering feedback and generating further non-linearities. 

In this respect, Frauenberger’s entanglement HCI (Sect. 1.3.2) suggests abandon- 
ing a user-centered design of the digital artifact in favor of participatory, speculative, 
and agonistic methods with the ultimate goal of obtaining meaningful relationships 
and not merely optimized processes relating to the human or the machine pole, or their 
interaction. It is useful to briefly recall these methods. The agonistic and adversarial 
design employs processes and creates spaces to foster vigorous but polite disputes 
involving designers’ participation in order to constructively identify inspiring ele- 
ments of friction [36]. On the other hand, the participation in a speculative process 
through designing provotypes aims to provoke a discussion about the technological 
and cultural future by considering creative, political, and controversial aspects [117]. 

The more degrees of freedom in the network configurations, the more behaviors 
can potentially be stimulated. The relational network should not be hardly controlled 
because its expressive potential can be exploited through its differentiation. In our 
opinion, the current immersive audio technologies are struggling to emerge, because 
they often introduce static agential cuts, justified by audio quality assessments con- 
ducted in a reductionistic way. On the contrary, the main goal of the digital twin is 
to favor the participation of all available configurations. Specific configurations and 
agential cuts emerge in a speculative, agonistic, and provocative manner so that all 
actors can benefit from different attempts following knowledge diffraction [6]. The 
learning in such fluid and dynamic evolution from one configuration to another is a 
continuous flow of knowledge that informs the digital twin’s activity. In other words, 
the digital twin continuously proposes new agential cuts to record and analyze the 
overall results. A relevant example in SIVE is the co-determination of the attentional 
focus in selecting the meaningful auditory information for a digital twin facing the 
cocktail-party effect [20]. The digital twin must be able to guide an active participa- 
tion with the VE considering listener’s available knowledge extracted by previously 
experienced and stored scenarios (and agential cuts). 

The continuous intra-action within the digital twin in relation to a shared and 
immersive experience is of strong practical relevance within the proposed theoreti- 
cal framework. This issue offers concrete possibilities for radically changing the way 
we interact socially in the future, by using digital tools equipped with computational 
intelligence and artificial intelligence (AI) algorithms able to manage complex sys- 
tems [107]. The decision-making phase of intelligent algorithms will improve over 
time, thanks to a dynamic identification and classification of configurations and links 
in the actor-network. The knowledge can be continuously extracted as a result of com- 
putational intra-actions of the human-in-the-loop type where the listener can be seen 
as an agent directly involved in the learning phase, step-by-step influencing cost 
functions and all other measures [69]. More in general, the reinforcement learning 
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paradigm focuses on long-term goals, defining a formal framework for the interac- 
tion between a learning agent and its environment in terms of states, actions, and 
rewards, hence no explicit definition of desired behavior might be required [35]. This 
process can be accomplished during exposure to a continuous stream of multimodal 
information like in the case of lifelong learning [109], or via interactive annotations 
and labeling [81]. 


1.4 A Taxonomy for SIVE 


Animportant contribution to the design in VEs comes from practice, e.g., professional 
reports and testimonials, best practices, or reviews and interpretations of lessons 
learned in the industry (see Chap. 6 and [76]). Taking into account all these inputs, 
academic studies, new technologies, and commercial user feedback, different com- 
munities draw support for their specific users and domains of interest. Within the 
SIVE field, there is still much work to be done. There is a lack of recommendations 
and design analysis on creating interfaces, interactions, and environments that fully 
exploit egocentric sonic information. To unlock such potential, our suggestion is to 
start from a multi and interdisciplinary work resulting in these foundational ques- 
tions: does a development path exist for the SIVE field? Is an ad-hoc theoretical 
approach necessary? Without going into the details of the epistemological crisis that 
is affecting the HCI field, we would try to avoid discussions on what is called in 
the HCI community intermediate knowledge [72] where positivist and constructivist 
perspectives are constantly clashing [45]. Examples of intermediate knowledge are 
all patterns/best practices proposed for certain aspects of the immersive experience. 

There exist several classifications attempting to describe virtual spaces for sound 
and music purposes. The recent formulation in [4] distinguished three aspects: 


e Immersive audio—the VE should provide the feeling of being surrounded by a 
world of sounds. 

e Interactive audio—the VE allows the user to influence the virtual world in some 
meaningful way. 

e Virtual audio—the virtual world must be dynamically simulated. 


They have already been extensively discussed in the previous sections and many of 
the existing taxonomies for VR [95, 134, 157] prioritizing the system (or simulation) 
or the user, not the close relationship with the listener. In this section, we propose an 
audio-centered taxonomy that does not distinguish between user and system, lis- 
tener and simulation. Our theoretical framework uses an egocentric audio perspective 
by emphasizing the situated, embodied, enactive dimensions of the listener’s experi- 
ences with their different actors involved. An emphasis on the entanglement between 
humans and technology assumes that the listener’s internal states are directly inac- 
cessible to a non-intrusive and external technology, i.e., focused on exteroceptive 
sense [134]. Accordingly, we will motivate the selection of three dimensions able 
to describe a technological mediation in VR: immersion, coherence, and entan- 
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glement. The qualitative description in this section leaves as a future challenge a 
quantification of the performative processes introduced here. 

Referring to the autobiographical element introduced in the book preface, the first 
meeting of the two chapter authors at the ACM CHltaly 2011, the biennial confer- 
ence of the Italian HCI community, has also a scientific meaning for the proposed 
taxonomy. The paper by Geronazzo et al. [50] was presented more than 10 years 
ago, as one of the first tasks of the first author’s doctoral program. He attempted to 
adapt the virtuality continuum of Milgram and Kishino [95] in the context of spatial 
audio personalization technologies for VR/AR. His main motivation was to over- 
come his difficulty in fitting the strong acoustic relationship (i.e., HRTF customiza- 
tion) between listener and technology into a taxonomy created for visual displays in 
1994. 

That paper proposes a characterization that uses a simplified two-dimensional 
parameter space defined in terms of the degree of immersion (DI) and coordinate 
system deviation (CSD) from the physical world. It is a simplification of Milgram’s 
three-dimension space, summarized in the following: 


Extent of World Knowledge (EWK): knowledge held by the system about virtual 
and physical worlds. 

Reproduction Fidelity (RF)—virtual object rendering: quality of the stimuli pre- 
sented by the system, in terms of multimodal congruency with their real counter- 
part. 

Extent of Presence Metaphor (EPM)—subject sensations: this dimension takes 
into account the observer’s sense of presence. 


CSD matches EWK with the distinction that a low CSD means a high EWK: the 
system knows everything about the material world and can render the synthetic 
environment in a unified mixed world. From an ecological perspective, the system 
knows and dynamically fosters the overlap between real and virtual. On the other 
hand, EPM and RF are not entirely orthogonal and the definition of DI follows this 
idea: when a listener is surrounded by a real sound, all his/her body interacts with the 
acoustic waves propagating in the environment, i.e., a technology with high presence 
can monitor the whole listener’s embodiment and actions (high DI). 

Recently Skarbez et al. [134] have proposed a revised version of Milgram’s vir- 
tuality continuum introducing two distinctive elements. First, the consideration of 
only two instead of three Milgram’s dimensions similarly to [50]: Immersion and 
Extent of World Knowledge. In particular, Immersion is exactly based on the same 
idea as DI. Second, they introduced a discontinuity in the RF and EPM dimensions 
considering the absence of any display at the left side of the spectrum: the physical 
world without mediation is inherently different from the highest level of realism 
achievable through VR technologies that stimulate exteroceptive senses (i.e., sight, 
hearing, touch, smell, and taste). The latter consideration propagates to Immersion. 

The rough taxonomy of Geronazzo et. al. missed the idea of coherence between 
simulation and human behavior, which is well identified as the third analytical dimen- 
sion of Skarbez et al. [134]: coherence. It takes into account both plausibility and 
expectation of technological behaviors for the user in cognitive, social, and cultural 
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terms. However, the three proposed dimensions cannot and do not claim to describe 
such a relationship between the user and the system as emphasized by the authors in 
their system-centered taxonomy. The work of Skarbez and colleagues is once again 
anchored to the distinction between user and system which generates several issues 
in framing the intra-actions of actors/factors in VR/AR sonic experiences. 

To support the SIVE theoretical framework, we focus on purely VR only. This 
means that our discussion will not consider the CSD/EWK dimension assuming that 
there are no anchors to the physical world. However, since we are emphasizing the 
influence of human-real-world relationships on experience in VE and vice versa, we 
have decided not to make the world configurations explicit thus considering them 
as a whole with the listener. Extensions to mixed reality will be an object of future 
studies in a reviewed version of our theoretical framework. 

Starting from the previously identified dimensions of Immersion [50] and Coher- 
ence [134], we suggest three top-level categories that need to be addressed through 
interdisciplinary design work. A schematic representation can be found in Fig. 1.5. 

Immersion: the digital information related to the listener-digital twin relationship 
supporting an increasing number of actions in VEs. It measures the technological 
level and its enactive potential between listener and auditory digital twin. 

Coherence: the digital information related to the digital-twin-VE relationship 
that allows the plausible rendering of an increasing number of behaviors in VEs. It 
measures the effectiveness of sonic interaction design in VEs. 

Entanglement: represents the overall effectiveness of the actor-network and its 
agential cuts that are dynamically, individually, and adaptively created. It measures 
participation in the locus of agency and its consequent phenomenological description. 
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The auditory digital twin actively proposes new relations favoring redefinitions in the 
agential cuts, i.e., the mutual transformative actions between listener and technology. 

To support our proposed taxonomy for SIVE, we introduce a case study on a 
fictitious and purely theoretical artifact along the lines of Flow [45]. It allows us to 
decline the various facets of the framework in a flexible example. 

Spritz! is an interactive and immersive VR simulation supported by full-body 
tracking, stereoscopic vision, and headphone auralization. Itis designed to address the 
cocktail-party effect. The human selective attention requires different contributions 
and levels of perception in supporting the ability to segregate signals-also referred 
to as auditory signal analysis [15, 20]. When confronted with multiple simultaneous 
stimuli (speech or non-linguistic stimuli), itis necessary to segregate relevant auditory 
information from concurrent background sounds and to focus the attention on the 
source of interest. This action is related to the principles of auditory scene analysis that 
require a stream of auditory information filtered and grouped into many perceptually 
distinct and coherent auditory objects. In multi-talker situations, auditory object 
formation and selection together with attentional allocation contribute to defining a 
model of cocktail-party listening [75, 132]. The design of Spritz! aims to give shape 
to an auditory digital twin able to detect listener intent, i.e., identify the relevance 
of a sound compared to other overlapping events. It can instantaneously determine 
the attentional balance within an auditory space. Its main goal is to promote the 
listener’s well-being through manipulations of the sound scene in a participatory 
way respecting the listener’s desires. 


1.4.1 Immersion 


According to Murray [98], the term immersion comes from the physical experi- 
ence of being immersed in water. In a psychologically immersive experience, one 
aims at experiencing the feeling of being surrounded by a medium that is a reality 
other than the physical one, able to capture our attention and all our senses. There- 
fore, it has an important element of continuity with our framework by identifying a 
mediating action of VR experiences. According to Slater and Wilbur [137], the term 
immersion is tightly linked to the technology, the mediator, to elicit the sense of 
presence. Technological systems for immersive VR count several combinations of 
equipment and techniques, such as HMD, multimodal feedback, high frame rates, 
and large tracking areas. Such a heterogeneous arsenal is a complex system of func- 
tional elements that have an immediate impact on the listener’s experience. Initially, 
technical specifications were reasonably identified as the main constraints fora VR 
experience. However, other elements were considered with a large-scale diffusion 
of VR technologies. The design of VEs became critical in those details that ensure 
a plurality of actions with virtual objects, the surrounding virtual world, and their 
representations. As discussed in [30], the effects of all these components are highly 
interconnected with each other. Moreover, the absence or misuse of any of them can 
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produce immediate disruptions in the sense of presence or cybersickness [33], such 
as low headset quality [16] or unfiltered noise caused by sound sources external to 
the VR setup [136]. 

The strong connection between immersion and equipment means that different 
VR solutions hold an intrinsic level of immersion regardless of the actual applications 
performed with them [120]. This is evident when considering basic audio quality vs. 
quality of the listening experience. For instance, considering projected screens offers 
designers of VEs the opportunity to combine real and virtual elements in the tracked 
area (Chap. 13 offers an interesting reflection on artistic performances mediated 
by VR/AR technologies). However, the overall sense of presence experienced by 
the listener depends on the specific combination of the HW/SW setup. Such setups 
support a certain type of action within the VE. The Immersion “I” dimension takes 
into account these features as the starting point of an enactive potential for the auditory 
digital twin. Such a potential intrinsically limits the development and creation of new 
actions. 

Furthermore, the enactive egocentric perspective of Sect. 1.3.1 provides a solid 
theoretical framework for considering the importance of ecologically valid auditory 
information in eliciting a sense of presence in a VR-mediated experience. First of all, 
it should be mentioned that there is a lack of research related to the effects of inter- 
active sound on the sense of body ownership and agency (refer to the discussion in 
Chap. 2). The vast majority of studies addressing presence from an auditory perspec- 
tive focus on place illusion and spatial attributes. This should not come as a surprise, 
since many of these binaural attributes are perceived by applying sensory-motor 
contingencies and embodied multisensory integrations. A simple example in spatial 
audio technologies is the importance of head movements data that are acquired by 
three degree-of-freedom head-trackers, allowing listeners to exploit binaural cues for 
resolving the so-called front-back confusion [22]. However, computational models 
for binaural cues are usually parameterized by the head radius or circumference, or 
ears position [52]. This example suggests that synchronization and plausible interac- 
tive variations, i.e., occurring in reaction to the digital twin’s gestures in coherence 
with sensorimotor contingencies, can positively influence the sense of agency. In 
addition, other studies demonstrate how the sound of action and an active explo- 
ration can support haptic sensations and vice versa in a co-located and simultaneous 
manner. For instance, Chap. 12 analyzes the impact of sound in an audio-tactile 
identification of everyday materials from a bouncing ball. 

Regarding spatial hearing, there is a huge differentiation in accuracy between 
more (experienced) and less (naive) reliable listeners [3, 51]. More generally, the 
distinction between categories of listeners is still challenging and is made based 
on several factors such as multisensory calibration and integration (see Chap. 12 
for audio-haptics), familiarity with immersive/spatial audio technologies, musical 
background [152], or audio mixing experience (as in Chap. 9), etc. Both acoustic 
(i.e., acoustic transformations of the body) and non-acoustic (i.e., everything else) 
factors are highly individual and depend on the relationship between the listener and 
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the real-world which is mediated by technology in a general sense (not only in the 
digital domain, e.g., games, musical instruments, etc.). 

All objectifiable information regarding the listener are known configurations. 
For example, bottom-up approaches for modeling psychophysical phenomena of 
spatial hearing and multisensory integration fall into this category. Such knowledge 
has to be integrated into the immersive system, explicitly contributing to the actor- 
network managed by the digital twin. 

Coming back to our Spritz! simulation, the level of I is expected to be high due 
to state-of-the-art technological components. The digital twin can recognize and 
manage several full-body skeletal configurations as well as near-field acoustics algo- 
rithms that take into account the acoustic coupling of the main joints such as the head 
and shoulders. This last aspect is usually largely underestimated in virtual acoustics 
systems [17]. The customization based on anthropometry allows the digital twin to 
guide the acoustic rendering of movements considering head tilt and torso shadow- 
ing in real-time. Furthermore, binaural and spectral cues might be personalized and 
weighted according to the listener’s level of uncertainty, allowing the digital twin to 
predict which sound sources are most likely to be segregated based on an egocentric 
direction-of-arrival perspective. 

The contribution of the I dimension can be summarized as follow: I is the digi- 
tal information related to the listener-digital twin relationship limited by a specific 
technological setup. The support of an increasing number of actions in VEs is a 
consequence of technological improvements (both HW/SW) and/or an increasing 
objectification of the listener’s configurations. Considering the idea of immersive 
potential of Chap. 11, limitations in enaction determine which changes are signifi- 
cant after technological manipulation. The level of reconfigurability within the digital 
twin accounts for the constant dialogue with the listener to explore her state and ten- 
dency to immersion in every moment of the experience (see also Sect. 1.4.3). 


1.4.2 Coherence 


The VR simulation must be able to make the digital twin freely interact with the 
VE, eliciting a plausible experience for the listener who is always aware of the 
mediated nature of the experience. In other words, the interaction design must support 
functionally and plausible actions, the ‘doing’ in [43]. This means that possible 
configurations of the technical setup and the listener (the objectification in the digital- 
twin, see Sect. 1.4.1) constitute the enactive potential of immersion and must be 
balanced within the sonic interactions. 

In this section, we focus on the coherence of the digital-twin/environment rela- 
tionship. On the other hand, Sect. 1.4.3 provides an interpretation of the dialogue 
with the immersion dimension. 

VE simulations can create fictional worlds, exploiting opportunities for both nat- 
uralistic and magical interactions [13]. Designers can experiment with defining rules 
that only apply in the virtual domain, such as scale, perspective, and time. The philo- 
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sophical discussion of the dualism “Doing vs. Being” in [4] provides interesting 
insights into our egocentric auditory perspective: simulation can have different lev- 
els of interactivity suggesting different action spaces for the digital twin in the virtual 
worlds. 

Interacting with the VE, avatar included, consists of altering the states of 3D 
elements that have been created at different levels of proximity: the virtual body 
(i.e., avatar), the foreground (i.e., peripersonal object manipulation space), and in 
the background (i.e., extra-personal virtual world space). Existing researches on 3D 
interaction focuses on the spatial aspects of the following main categories: selec- 
tion, manipulation, navigation, and application control (the latter involving menus 
and other VE configuration widgets). Selection techniques allow users to indicate an 
object or a group of objects. According to the classification of Bowman et al. [14], 
one can consider selection techniques based on object indication (occlusion, object 
touch, pointing, indirect selection), activation method (event, gesture, voice com- 
mand), and feedback type (text, acoustic, visual, force/tactile). Manipulation tech- 
niques allow the digital twin to modify all virtual objects configurations that are 
made accessible to it: e.g., the spatial transformation of objects, i.e., roto-translation 
and scaling, surface properties such as material texture and acoustic properties, or 
3D shape and structure manipulations. For the variety of interaction metaphors for 
selection, we refer to a recent review in [92]. Finally, navigation techniques allow 
digital twins to move within the VE to explore areas and virtual worlds. Typical 
movements include walking and virtual transportation, including flight experiences. 
In particular, walking is fundamental to humans, and supporting natural locomo- 
tion is not always feasible on a limited tracked space. Accordingly, there are other 
interaction metaphors such as walk-in-place [42], teleportation, or semi-automatic 
movements between control points [61]. It is worthwhile to mention the self-motion 
illusions. In circular vection [116], moving sounds surrounding the listeners facili- 
tate the perception of being in motion when in fact they are not. For spatial design 
considerations in sonic interactions, Chap. 6 provides a comprehensive analysis and 
a typology of VR interactive audio systems. 

These configurations must be plausible and the digital twin should support a 
dynamic transition from one to another. This is crucial to avoid irreparable breaks 
in presence. Therefore, coherence “C” describes the degrees of freedom introduced 
by the sonic interaction design in VEs based on the active dialogue between the 
digital-twin and the VE, established experience after experience. 

In this section, we are particularly interested in the plausibility illusion determined 
by the overall credibility of a VE concerning subjective expectations. It is not only a 
coherence between external events not directly caused by listeners but an objective 
feature of the VE [134]. Its reconfigurability includes an internal logical coherence 
and a behavioral consistency considering prior knowledge. Sound conveys eco- 
logical information relevant to the expectation toward VE behaviors compared to 
the listener’s everyday experience: embodied, and situated in a socio-cultural con- 
text. The environment configurations (avatars and virtual worlds) intertwine with the 
known listener configurations held in the digital twin. Once again, the digital twin 
has a central and active role following an egocentric audio perspective (see Fig. 1.4 
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for this foundational idea). Dimension C advocates a top-down approach to inter- 
actions, constituted of cognitive and socio-cultural influences based on listener real 
life. 

Moreover, coherence does not presuppose physical realism. It fosters interactions 
in coherent virtual magic worlds. The dynamic dialogue between VE and digital- 
twin makes it possible. For example, let’s consider a cartoon world where simplified 
descriptions of sound phenomena exaggerate certain features [118]. It may be plau- 
sible as long as it conveys relevant ecological information. Audio procedural mod- 
els are based on simplifications in properties and behavior of a the corresponding 
real object, i.e., simplified configurations. Such parameterization can be informed by 
auditory perception and cognition maintaining ecological validity of a fictional sonic 
world while reinforcing the listener’s sense of agency. Digital information regarding 
the relationship between the digital twin and VE allows the creation of an increasing 
number of plausible behaviors in VR. 

Considering once again the distinction among avatar, peri- and extra-personal 
spaces, neurophysiological research on body ownership and multisensory integration 
suggests the existence of a fluid boundary in the perceived space by subjects [60]. 
It is worth noticing that the neuronal activity sensitive to the appearance of stimuli 
within the personal space is multisensory in nature and involves neurons located in 
the frontoparietal area. In this area, neuronal activity is related to action preplanning 
particularly for reacting to potential threats [130] and elicits defensive movements 
when stimulated [32]; these multimodal neurons combine somatosensory with body 
position information [58]. Bufacchi and Iannetti [24] suggested that the personal 
space should be described as a series of action fields that spatially and dynamically 
define possible responses and create contact-prediction functions with objects. Such 
fields may vary in location and size, depending on the body interaction within the 
environment and its actual and predicted location. Space is also modulated in response 
to external stimuli and internal states of the subject, defining a relationship between 
listener, environment, and tools [119]. 

Of particular interest for our framework are modulations due to the proxemics. 
The term was introduced by Hall [63] and concerns implicit social rules of interper- 
sonal distance among people conveying different social meanings. The cooperation 
in a socially shared interpersonal space [144] requires to support the transition from 
individual to collaborative spaces [142]. In Chap. 8, the design of sound intensity (or 
sound attenuation) as a function of the proximity from a sound source is addressed. 
Different configurations of personal and public spaces were tested in a shared VE 
for collaborative music composition. Interestingly, rigid boundaries in the transition 
between spaces forced listeners to take a social distance and isolate with a nega- 
tive impact on the collaborative aspects of the composition process. Therefore, the 
separation between public and personal space should be fluid rather than rigid. The 
VE should be configurable in the social aspects that emerge from the strong inter- 
connection between configurations made available to the digital twin, increasing the 
fluidity and better supporting collaboration in shared experiences. 

In Spritz!, we should identify the VE’s abilities in shaping the simulation within the 
digital twin. First, Spritz! has multiple configurations accounting for different strate- 
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gies of the level of audio details. The radial distance with an egocentric reference can 
drive the dynamic definition of three partially overlapping levels of detail associated 
with proximity profiles: avatar, personal and public. The avatar’s movement sounds 
are rendered through procedural approaches with individualized configurations based 
on listener acoustics; in the personal space, Spritz! can manipulate sound behavior 
with simplified models taking into account security and privacy levels required by 
the situated and embodied states of the digital twin. Finally, sounds in the public 
space can be clustered, grouped, or attenuated by implementing plausible statistical 
behavior, e.g., using audio impostor replacement such as audio samples. 

The Spritz! environment should facilitate resolutions of the cocktail-party prob- 
lem in crowded situations. Accordingly, it should be able to apply noise suppression 
of negligible information in the public space or vice versa to operate audio enhance- 
ments supporting attentive focus. This dynamic connection between VE and digital 
twin should be able to maintain coherence in the induced behaviors, supporting the 
plausibility of actions while bending the space around the listener. 

A meaningful manipulation of virtual spaces is crucial and creative. Since SIVE 
naturally includes researches in music composition, a VE must foster the develop- 
ment of individual or collaborative creative ideas through dynamic control of its 
configurations within and by the digital twin. In particular, results in Chap. 8 sup- 
port VE spatial design as the creation of “magical” exploratory opportunities, adding 
original dynamics to collaborative work in VEs. The digital twin has a pivotal role 
in such space modulations that allow tracing boundaries performatively and eliciting 
internal emotional states following the listener/composer’s expectations. 


1.4.3 Entanglement 


The listener’s susceptibility to immersive VE experiences is usually determined by 
administering questionnaires [155, 156]. The experimenters’ aim is usually to per- 
form a screening test to distinguish who can and will be able to easily immerse in a 
VR-mediated situation. Furthermore, this separation is assumed to remain constant 
throughout a short-enough experiment. However, the immersive tendency can change 
over time due to training, learning, experience, mood changes and personality, etc. 
(see Chap. 11 for further details). For such reasons, common recommendations for 
VR experiments suggest conducting single experimental sessions. However, study- 
ing the impact of the aforementioned dynamic changes opens to the third and last 
dimension of our taxonomy: entanglement, which is the knowledge extraction from 
the evolution of an actor-network able to reveal multiple facets of the egocentric 
experience in time, space, and intra-actions. 

The first step requires describing the available configurations. Starting from the 
idea of immersive tendency, VR simulation would benefit from the knowledge of 
the listener’s susceptibility toward configurations of setup and environment to mod- 
ify or avoid non-significant experiences, e.g., getting a break in illusion. In other 
words, (quantifiable) listener configurations must be defined, discovered, and actively 
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explored by the digital twin. For example, the way sound samples are engineered 
is very interesting here. A sliding friction sample, e.g., squeaking, rubbing, etc., 
requires a large amount of data and randomization techniques to avoid repetition. 
Sounds should be consistent with the listener’s expectations in response to com- 
plex and continuous motor actions. For this reason, procedural audio approaches can 
tightly connect the sound to complex and continuous motor actions. 

The Entanglement dimension (“E” ) aims to provide a phenomenological char- 
acterization of actors’ evolution and activities based on their performativity and par- 
ticipation in a locus of agency. We realize the high complexity of such a descriptive 
and formal process, but we believe that an attempt in capturing the transformative 
potential of VR in mediated experiences is worthwhile to be conducted for the SIVE 
discipline. Of great importance here is the idea of monad by sociologists Tarde and 
Latour [84, 143]: “A monad is not a part of a whole, but a point of view on all the 
entities taken severally and not as a totality”.One can consider a monad as a rela- 
tional perspective of each actor, shifting the emphasis from aggregation of the whole 
to movement between different points of view. The main purpose of any perspective 
is the structural analysis of the network and its configurations and, at a later stage, 
to derive knowledge and understanding of its dynamics. The inherently egocentric 
local perspective of the locus of agency, i.e., digital twin, is again emphasized as 
opposed to a global view. Egocentric networks built around specific nodes such as 
the listener configurations can support the exploration of intra-activated dynamics. 
Configurations and links can be discovered and/or modified during different mediated 
experiences. 

The collaboration among actors is vital in integrating different points of view, 
creating opportunities for meaningful experiences. In shared VEs (Chap. 8), listen- 
ers are co-present with other human participants interacting in an interpersonal way. 
Research interests of computing-supported cooperative work can provide interest- 
ing insights into prioritizing collaboration [99]. The choice of collaborative models 
fostering the design of active VEs for meaningful and creative experiences is of 
particular relevance to entangled SIVE. 

Intentionality and gesture support can be achieved through continuous network 
reconfiguration. Identifying common goals through inter-actor communication are 
fundamental requirements to increase the digital-twin enactive potential. We argue 
that this area of research is absolutely new for SIVE, especially in these collabo- 
rative aspects. Many fundamental and critical questions for SIVE are waiting to be 
answered. 

Digital transformation promotes ubiquitous and pervasive interconnected data 
sets with the opportunity to offer new ways of navigating and extracting knowledge. 
Dork et al. [37] explored the visualization of relational information spaces, incor- 
porating both the individual and the whole in a monadic perspective. The authors’ 
goal was to exploit the rich semantic connections to design new exploration meth- 
ods for interconnected elements. There is an increasing interest in more exploratory 
forms of information retrieval without specific needs/constraints, sustained by the 
desire to learn, play and discover openly [91]. In analogy with these practices, the 
digital twin should curiously move between nodes, configurations, and connections 
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experimenting and manipulating the actor-network for sense-making. To encourage 
surprising discoveries and interest within experiences, the digital twin should offer 
unconventional and appealing views with the agency. 

The auditory digital twin actively proposes new relationships and encourages 
agential cuts under the mutual transformative action between listener and technol- 
ogy. In the monadic perspective of the digital twin, the distinctive qualities of each 
actor within a VE should emerge in each situated experience. Differentiation among 
configurations is not an a priori actor property but it is identified by its uniqueness 
in the network. Each actor imprints its particular identity on an ever-changing rela- 
tional world. In other words, the digital twin is looking for differences in each actor 
by considering different monadic perspectives. VR simulations allow us to take the 
point of view of each element thanks to a shared virtual world knowledge. 

In the area of AI agents, i.e., non-human entities capable of interacting with eco- 
logical behaviors [109], intelligent algorithms would have the predictive potential 
on the listener’s action program. Their ability in monitoring and predicting listeners’ 
behavioral responses could enable the digital twin to determine listeners’ expec- 
tations and cognitive and psychological capabilities [25]. Moreover, AI algorithms 
could propose exploration paths to the listener within VEs. Therefore, the capabilities 
of safely navigating through temporary, transient, and overlapping configurations are 
definitely complimentary to their predictive power. 

In line with the emerging research area called immersive analytics, humans and 
Alcan support each other in decision-making based on the navigation in shared think- 
ing spaces [133]. Meetings between the listener and her digital twin can take place in 
a virtual meta-environment where configurations and connections of an experience 
can be a posteriori analyzed, collaboratively. The unique personal supervision of the 
AI algorithms implemented in the digital twin could reflect the listener’s traits and 
interests. Understanding the listener’s preferences and assessing their impact on the 
predictive performance of AI algorithms can help to propose adaptive and customiz- 
able systems with a certain level of memory of past VR-mediated experiences [103]. 

Finally, how can we measure the overall effectiveness of an actor-network and its 
agential cuts that the digital twin dynamically, individually, and adaptively creates? 
This question corresponds to Latour et al.’s challenge to take into account long-term 
features, indicative of a systemic order that might be learned navigating overlapping 
perspectives (monads) [84]. Such an emphasis on navigation gives a unique role 
to movement/exploration as a way of experiencing relationships and differences 
between configurations. Therefore, we suggest that the digital twin should navigate 
along with different and novel perspectives for sense-making. The dynamic relational 
quality of each actor’s unique position in network space, i.e., agential cuts, reflects 
the exploration potential shaping and creating meaning for the listener. 

We argue that the VR-mediated experience is never solitary, considering both 
human and non-human actors. Any actor cooperates within a shared VE, e.g., to 
perform a musical performance (Chap. 13) or a spatialized audio mixing (Chap. 9). 
Collaboration takes place on a common task, which has a huge impact on the intra- 
action dynamics. In addition to the exploratory movements, technological trans- 
parency introduced in Sect. 1.3 is a key factor influencing “E” measures. In analogy 
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with the sense of presence, co-presence [26], i.e., the feeling of sharing a VE with 
others, has been shown to strongly depend on avatar appearance and its realism, as 
well as on the cooperation level in task completion [111]. Another aspect worth men- 
tioning here is the awareness [7] which is the action understanding of other actors, 
especially with non-human agents. This latter concept strongly relates to trustworthy 
Al issues and explainable AI [70]. 

A further “E” measure in SIVE can be inspired by the River and MacTavish’s 
framework [117]. They proposed to generate low-level prototypes of an artifact from 
simplified attributes. The more extreme the change in such attributes, the more likely 
the change will be to provoke and reveal hidden assumptions in the design process. In 
our taxonomy, we call it generative potential in explorative movements and network 
changes, and technological transparency. 

The final example in our fictitious case study Spritz! considers the meaningful 
prediction of the listener’s intentionality and the understanding of any sources of 
interest, e.g., avatar’s gestures or other avatars’ action. Spritz! should be able to 
support attentional focus. A virtual ray/cone pointer projected by the avatar through 
the VE or a virtual cursor/hand mapped to the listener’s body movements might 
facilitate the selection of points of interest. Gesture analysis could provide Spritz! 
relevant information for a semi-automatic focus support. This scenario opens to the 
experimentation and development of “magic” interactions of virtual superhuman 
hearing tools such as a dual audio beamformer guided by the avatar’s body [52]. 
Spritz! should be free to propose novel ways of interaction and exploration within 
VEs. This dynamic dialogue can be considered a form of virtual provotyping that 
has to guarantee coherence with all available sensorimotor contingencies, having a 
positive effect on the listener’s sense of agency in any proposed behavior. 


1.5 Conclusion 


This chapter aims at emphasizing how the SIVE book was born and developed in a 
constantly evolving situation in the field of human-computer interaction. We invite 
the reader to explore all its chapters with this shared and dynamic tension that we, 
as editors, have tried to formalize in what we have called the egocentric perspective 
of the auditory digital twin. The co-transformation of man and technology seems to 
us a central theme that will surely help us to enter the 4” HCI wave, consciously. 
The proposed taxonomy focuses on action, behavior, and sense-making because 
we believe it is a meaningful way for authentic auditory experiences in VR. In 
particular, the last aspect of sense-making turns out to be the most challenging. 
The idea of diffraction and exploration of differences and discoveries requires novel 
ways of scientific investigation in SIVE. The most crucial aspect might be the level 
of personalization that future technologies will require to acquire from the listeners. 
New paradigms for artificial and immersive interaction between humans and VE will 
have to be proposed. The attribution of agency to a digital twin is a network effect 
that will have relevant ethical implications, as well as complexity in its analysis. 
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How much would the listener trust her digital twin? Its intermediary role, some- 
times provocative, in search of differences can elicit strong reactions in the listeners. 
Will the listener accept and share this perspective? The affective information strongly 
links sound to meaning [138], creating empathy between listener and her digital twin. 
This aspect will be carefully considered for its ethical implications. 

How can one quantify and classify the various actor networks in the proposed 
three dimensions? Surely, this is an open challenge of this first proposed theoreti- 
cal framework for SIVE. Visualizing and representing transitions and agential cuts 
are relevant issues toward an objective description of any mediation phenomenon. 
Creating multiple ontologies in “magical” interaction metaphors allows to transcend 
reality and immerse into unique experiences within VEs. Since VR is not yet able 
to fully replicate natural reality and may not be able to do so, its current features 
actually allow listeners to do and be things that are impossible in the real world. 
This is the very essence of knowledge diffraction: the digital twin should explore 
such differences that are impossible to test in the physical world, extracting mean- 
ing for the listener. Of particular interest here, the ideas of superhuman powers and 
virtual prototyping [52] reflect human desire to increase her capabilities. They are 
receiving increasing attention thanks to the post-humanism and human enhancement 
manifestos [97]. Following this line of thought, Sadeghian et al. [121] proposed to 
VR designers to explore new forms of interaction without necessarily imitating the 
physical world. VR’s limitations in creating realistic interactions are replaced by a 
focus on experiences that are impossible to have in the real world, such as superhu- 
man powers of flying, X-ray vision, shape-shifting, super memory, etc. Limitations 
obviously occurred while differentiating VEs before confusion invades the listener. 
Indeed, a balance in ecological and familiar stimulation should guide the creation of 
a “safety net” or “comfort zone” for the listener—the digital twin’s exploration of 
agonistic and provocative knowledge opportunities without drawbacks. 

This chapter aims to shape the SIVE research field, sonic interactions in VEs, 
that is now ready to welcome wide-ranging reflections on what might be called 


sonic intra-actions in VEs. 
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Part II 
Interactive and Immersive Audio 


Chapter 2 A) 
Procedural Modeling of Interactive get 
Sound Sources in Virtual Reality 


Federico Avanzini 


Abstract This chapter addresses the first building block of sonic interactions in 
virtual environments, i.e., the modeling and synthesis of sound sources. Our main 
focus is on procedural approaches, which strive to gain recognition in commercial 
applications and in the overall sound design workflow, firmly grounded in the use of 
samples and event-based logics. Special emphasis is placed on physics-based sound 
synthesis methods and their potential for improved interactivity. The chapter starts 
with a discussion of the categories, functions, and affordances of sounds that we listen 
to and interact with in real and virtual environments. We then address perceptual 
and cognitive aspects, with the aim of emphasizing the relevance of sound source 
modeling with respect to the senses of presence and embodiment of a user in a virtual 
environment. Next, procedural approaches are presented and compared to sample- 
based approaches, in terms of models, methods, and computational costs. Finally, 
we analyze the state of the art in current uses of these approaches for Virtual Reality 
applications. 


2.1 Introduction 


Takala and Hahn [86] were possibly the first scholars who proposed a sound rendering 
pipeline, in analogy with the image rendering pipeline, aimed at producing an overall 
“soundtrack” starting from a description of the objects in an audio-visual scene. 
Their pipeline included sound modeling and sound rendering stages, running in 
parallel with the image rendering pipeline. Figure 2.1 proposes an updated picture, 
which considers several aspects investigated by researchers throughout the last three 
decades and may represent a general pipeline for sound simulation in Virtual Reality 
(hereinafter, VR). 

Much of recent and current research is concerned with aspects related to the 
“Propagation” and “Rendering” blocks represented in this figure, as well as the 
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Fig. 2.1 A general pipeline for sound simulation in Virtual Reality (figure based on [51]) 


geometrical and material properties of acoustic enclosures in the “Modeling” block. 
This chapter focuses instead on the remaining balloon of the “Modeling” block, the 
modeling of sound sources. 

One obvious motivation for looking into sound source modeling is that all sounds 
occurring in a virtual (and in a real) environment originate from some sources, 
before propagating into the environment and finally reaching the listener. Secondly, 
many of the sonic interactions occurring in a virtual environments are interactions 
between the subject’s avatar and sound sources. Here, our definition of interactive is 
analogous to the one given by Collins [20] for video-game audio: whereas adaptive 
audio generically refers to audio that reacts appropriately to events and changes 
occurring in the simulation, interactive audio refers to sound events occurring directly 
in reaction to avatar’s gestures (ranging from pressing a button to walking or hitting 
objects in the virtual scene). 

The current dominant paradigm in VR audio, largely based on sound samples! 
triggered by specific events generated by the avatar or the simulation, is minimally 
adaptive and interactive. This is the main motivation for looking into procedural 
approaches to sound generation. 


2.2 What to Model 


The first question that should be asked is as follows: what are the sound sources that 
need to be modeled in a virtual environment, and how can these be organized into 
a coherent and comprehensive taxonomy? Such a taxonomy would provide a useful 
tool to analyze in a systematic way the state of the art of the research in this field and 
possibly to spot research directions that are still under-explored. 


' For the sake of clarity, in this chapter, we use the term “sample” in its commonly accepted meaning 
of pre-recorded/pre-processed sound excerpt, rather than that of a single value of a digital signal. 
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2.2.1 Diegetic Sounds 


One first possible and often used distinction can be mutated from narrative theory. 
The term diegesis has been used in film theory to refer to the fictional world of the 
film story, and correspondingly the adjective diegetic refers to elements that are part 
of the depicted fictional world. By contrast, non-diegetic elements are those which 
should be considered non-existent in the fictional world. 

As far as sound in particular is concerned, three main categories are traditionally 
used in films: speech and dialogue, sound effects, and music [80]. The first two 
categories comprise diegetic sounds, while music is a non-diegetic element having 
mostly an affective and emotional role, a distinction that may be related to the motto 
“Sound effects make it real, music makes you feel” [49]. 

Several taxonomies for sounds in video-games have been proposed and are typ- 
ically based on similar categories [42]. These may be employed in the context of 
VR as well, with the additional caveat that VR applications only partly overlap with 
video-games. In particular, VR, and immersive VR specifically, may be defined as 
“a medium in which people respond with their whole bodies, treating what they 
perceive as real” [77]. In light of this definition, in this chapter, we focus on diegetic 
sounds, those that “make it real”: in other words, those that contribute most to the 
overall sense of the presence of a user within a virtual environment, which we will 
discuss in Sect. 2.3. 

An interesting example of a taxonomy for sound in games is provided by Stock- 
burger [84], who considers five different types of sound objects. Non-diegetic ele- 
ments include (i) music, but also (ii) interface sounds, which may sometimes be 
included into the diegetic part of the game environment; proper diegetic elements 
instead comprise the three categories of (iii) speech and dialogue, (iv) ambience (or 
“zone” sounds in Stockburger’s definition), and (v) effects. 

Speech and dialogue are very relevant components of a virtual environment; how- 
ever, our focus in this chapter is on non-verbal sound. The distinction between ambi- 
ence and effect sounds is mainly a perspectival one: the former are background 
sounds, connected to locations or zones (understood both as different spatial loca- 
tions in an environment and different levels in a game) and having distinct auditory 
qualities; the latter are instead foreground sounds other than speech, that are cog- 
nitively linked to objects or events, and are therefore perceived as being produced 
by such objects and events. Sound-producing objects may be moving or static ele- 
ments, may be directly interactable by the avatar or just synchronized to the visual 
simulation, or may be even outside the visual field of view. 

Stockburger [84] proceeds in distinguishing effect subcategories, depending on 
the elements of the environment they are linked to. His classification is heavily 
tailored to games, but serves as an inspiration to further inspect and subdivide effect 
sounds. For the purpose of the present discussion, we only make a distinction between 
two subcategories: (i) effects linked to the avatar, and (ii) all remaining effects in 
the environment. Effects linked to the avatar are related to sounds produced by the 
avatar’s movement or object manipulation: footsteps, swishing of an object cutting 
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interactivity of diegetic 
sounds in a virtual 
environment 
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Effects 


through the air, knocking on a wall, clothes, etc. They can also include sounds 
produced by the avatar’s own body, such as breathing or scratching. The remaining 
effects in the environment may include non-verbal human sounds, sounds produced 
by human activities, machine sounds, and so on. A visual summary is provided in 
Fig. 2.2. The categories and subcategories identified here can be usefully mapped 
into interactive and adaptive sound sources. 


2.2.2 Everyday Sounds 


An orthogonal approach with respect to the previous one amounts to characterizing 
sound sources in terms of the physical mechanisms and events that are associated to 
those sources. 

Typical lists of audio assets for games or VR include, at the second level of clas- 
sification (after the branch between ambience and sound effects), such categories as 
footsteps, doors, wind and weather, and cars and engines, with varying degrees of 
detail. These categories in fact refer to objects and events that are physically respon- 
sible for the corresponding sounds; however, such classifications follow common 
practices rather than a standardized taxonomy. A more systematic categorization can 
be found in the classic works by Gaver [33, 34], who proposed an “ecological” cat- 
egorization of everyday sounds (the ecological approach to auditory perception will 
be discussed in more detail in Sect. 2.3.2). Gaver derived a tentative map of everyday 
sounds, which is shown in Fig. 2.3 and discussed in the remainder of this section. 

At the highest level, Gaver’s taxonomy considers three broad classes of sounds: 
those involving vibrating solids, liquids, and aerodynamics in sound generation, 
respectively. Sounds generated by solid objects have patterns of vibrations structured 
by a number of physical attributes: those of the interaction that has produced the 
vibration, those of the material of the vibrating objects, and those of the geometry and 
configuration of the objects. Sounds involving liquids (e.g., dripping and splashing) 
also depend on an initial deformation that is counter-acted by restoring forces in 
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Fig. 2.3 A taxonomy of everyday sounds that may be present in a virtual environment. Within each 
class (solids, liquids, and gases), rectangles, rounded rectangles, and ellipses represent basic, pat- 
terned, and compound sounds, respectively. Intersections between classes represent hybrid sounds. 
Figure based on the taxonomy of everyday sounds by Gaver [34, Fig. 7] 


the material, but no audible sound is produced by the vibrations of the liquid and 
instead the resulting sounds are created by the resonant cavities (bubbles) that form 
and oscillate in the liquid. Aerodynamic sounds are caused by the direct modification 
of atmospheric pressure differences from some source, such as those created by an 
exploding balloon or by the noise of a fan, or even events in which such changes 
in pressure transmit energy to objects and set them into vibration (e.g., when wind 
passes through a wire). 

At the next level, sounds are classified along layers of complexity, defined as 
follows. “Basic” sound-producing events are identified for solids, liquids, and gases: 
sounds made by vibrating solids may be caused by impacts, scraping, or other inter- 
actions; liquid sounds may be caused by discrete drips, or by more continuous splash- 
ing, rippling, or pouring events; and aerodynamic sounds may be made by discrete, 
sudden changes of pressure (explosions), or by more continuous introductions of 
pressure variations (gusts and wind). “Patterned” sounds are situated at a higher level 
of complexity, as they are produced through temporal patterning of basic events. As 
an example, walking, breaking, bouncing, and so on are all complex events involv- 
ing patterns of simpler impacts. Similarly, crumpling or crushing are examples of 
patterned deformation sounds. “Compound” sounds occupy the third level of com- 
plexity and involve more than one type of basic and patterned events. An example 
may be provided by the sound of a door slam, which involves the squeak of scraping 
hinges and the impact of the door on its frame, or a complex activity such as writing, 
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which involves irregular temporal patterns of both impacts and scrapes. Compound 
sounds involve mutual constraints on their building components: as an example, 
concatenating the creak of a heavy door closing slowly with the slap of a light door 
slammed shut would arguably not sound natural. 

Finally, Gaver’s taxonomy also considers “hybrid” events, in which two or three 
types of material are involved. An example of a hybrid sound involving solids and 
liquids is the one produced by raindrops hitting a window glass, which involves 
attributes of both liquid and vibrating solid sounds. 

A taxonomy such as the one discussed here has at least two very attractive features. 
First, it provides a comprehensive framework for classifying any everyday sound 
potentially encountered in our world (and thus in a virtual world as well), with a fine 
level of detail. Secondly, its hierarchical structure provides a theoretical framework 
that can aid not only the sound design process but also the development of sound 
design tools. An example of an ecologically inspired software library for procedural 
sound design will be discussed in Sect. 2.5.3. 


2.3 Perceptual and Cognitive Aspects 


In this section, we critically review and discuss some relevant aspects related to 
the perception and cognition of sonic interactions and provide links between these 
aspects and central concepts of VR, such as the plausibility illusion, the place illusion, 
the sense of embodiment, and the sense of agency. Nordahl and Nillson [57] also 
consider how sound production and perception relate to plausibility illusion, place 
illusion, and the sense of body ownership, although from a somewhat different angle. 

Our main claim is that interactive sound sources in a virtual environment contribute 
in particular to the plausibility illusion, the sense of agency, and the sense of body 
ownership. In addition, our analysis of perceptual and cognitive aspects provides 
requirements and guidelines for the development and the implementation of sound 
models. 


2.3.1 Latency, Causality, and Multisensory Integration 


In any interactive system, latency and its associated jitter have a major perceptual 
impact. High latency or jitter may impair the user’s performance or, at least, provide 
a frustrating and tiring experience. Perceptually acceptable limits for latency and 
jitter in an interactive system should therefore be determined. However, such limits 
depend on several factors which are not easily disentangled. 

Characterizing latency and jitter in the sound rendering pipeline can be restated as 
a problem of perceived synchronization between pairs of events [46], which in turn 
may be divided into three categories: (i) an external and an internal temporal pattern 
(such as those occurring in a collaborative activity, e.g., music playing, between two 
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persons in a virtual environment); (ii) pairs of external events (which may or may 
not pertain to the same sensory modality, such as pairs of sounds or a visual flash 
and a sound); (iii) actions of the user and their effects (e.g., the pressing of a button 
and the corresponding feedback sound). 

The latter case in particular is tightly connected to the definition of interactive 
sound adopted in this chapter. It is inherently a problem of multimodal synchroniza- 
tion, as it involves a form of extrinsic (auditory) feedback and a form of intrinsic (tac- 
tile, proprioceptive, and kinesthetic) feedback generated by the user’s action [53]. The 
complex interaction occurring between these modalities influences their perceived 
synchronization (and thus the acceptable latency). High latencies can deteriorate the 
quality of the interaction, impair the performance on a given task, and even disrupt the 
perceived link of causality between the user’s action and the resulting sonic outcome. 

The task at hand also influences the acceptable latency. As an example, it has 
been traditionally accepted that music performance is a task requiring extremely low 
(< 10 ms) latencies between the player’s actions and the response of a digital musical 
instrument [99]. Similarly, it has been shown that even small amounts of jitter can 
be detrimental to the perceived quality of the interaction [41]. In this respect, music 
provides a good “worst case” and a lower bound for latency in other, non-musical 
tasks, where various studies suggest that higher latencies may be acceptable or even 
unperceivable [43, 93]. 

The type of interaction must be considered as well. Impulsive interactions (either 
musical, such as playing a drum, or non-musical, such as knocking on a door) are 
likely to require lower latencies than continuous ones (bowing a violin string, or 
accompanying a closing door). As an example, it has been shown that the continuous 
interaction involved in playing a theremin allows for relatively high (> 30 ms) laten- 
cies, despite this being a musical task [54]. Finally, cognitive aspects also play a role: 
humans create expectations for the latency between their actions and the resulting 
feedback, detect disturbances to such expectations, and compensate for them. A study 
on the latency in live musical sound monitoring [48] showed significant discrepan- 
cies between different instruments, suggesting that certain players (e.g., pianists) are 
more tolerant to latency as they are accustomed to the inherent mechanical latency 
of their instrument, while others (e.g., drummers) are less so. 

We conclude this section with a hint at the second type of synchronization men- 
tioned at the beginning, i.e., that between pairs of external (possibly multimodal) 
events. Humans achieve robust perception through both the combination and the 
integration of information from multiple sensory modalities: the former strategy 
refers to interactions between non-redundant and complementary sensory signals 
aimed at disambiguating the sensory estimate, while the latter describes interactions 
between redundant signals aimed at reducing the variance in the sensory estimate 
and increasing its reliability [28]. The temporal relationships between inputs from 
different senses play an important role in multisensory combination and integra- 
tion, which can be realized only within a window of synchrony between different 
modalities (e.g., auditory and visual, or auditory and haptic feedbacks) where a sin- 
gle percept is produced. Many studies [19, 83, 96] report quantitative results about 
“integration windows” between modalities, which can be used as constraints for the 
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synchronization of the sound simulation pipeline with the visual (and possibly the 
haptic) modality. For more details regarding these issues, please refer to Part IV in 
this book, and in particular to Ch. 10. 


2.3.2 Everyday Listening and the Plausibility Illusion 


Human listeners are extremely good at interpreting sounds in terms of the events 
that produced them. The patterns of mechanical or aeroacoustic vibrations generated 
by sound-producing events depend on (and thus carry information about) contact 
forces, duration of contact, time-variations of the interaction, sizes, shapes, materials, 
and textures of the involved objects. We are immersed in a landscape of everyday 
sounds since the day we are born, and we have learned to extract meaning from this 
continuous and omnidirectional flow of information. 

Gaver [34] introduced the concept of everyday listening, as opposed to musical 
listening. When a listener hears a sound, she might concentrate on attributes like 
pitch, loudness, and timbre, or she might notice its masking effect on other sounds. 
These are examples of musical listening, meaning that the considered perceptual 
dimensions and attributes have to do with the sound itself, and are those used in the 
creation of music. On the other hand, the listener might concentrate on the char- 
acteristics of the sound source and possibly the surrounding environment. When 
hearing an approaching car, she might notice that the engine is powerful, that the 
car is approaching quickly from behind, or even that the road is a narrow alley with 
echoing walls on each side. This is an example of everyday listening. 

The two perceptual processes associated to musical and everyday listening cannot 
be completely disentangled and may occur simultaneously. Still, the idea that in 
our everyday listening experience the physical characteristics of sound-producing 
objects can be linked to the corresponding acoustic features is a powerful one. The 
literature of ecological acoustics provides several quantitative results on such links. 
The underlying assumption is that the flow of acoustic energy reaching our ears, the 
acoustic array, contains specific patterns, or invariants, which the listener exploits 
to infer information about the environment and guide her action. These concepts and 
terminology originate in the framework of ecological perception, rooted in Gibson’s 
works on visual perception in the 1950s [35, 55]. 

Acoustic invariants associated to sound events may include several attributes of 
a vibrating solid, such as its size, shape, and density, as these attributes contribute 
differently to characteristics of the resulting sound such as pitch, spectrum, amplitude 
envelope, and so on. In patterned sounds (see Sect. 2.2.2), the relevant information 
is also carried by the timing of successive events: footstep sounds must occur within 


2 In this context, the label “ecological” is associated to two main concepts: first, perception is an 
achievement of animal-environment systems, not simply animals, or their brains; second, the main 
purpose of perception is to guide action, so a theory of perception cannot ignore what animals do. 
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a range of rates and regularities in order to be perceived as walking; the regularity 
in the temporal pattern of a bouncing sound provides information about the shape of 
the object (e.g., a sphere versus a cube). 

The mapping between physical parameters and acoustic features is in general 
many-to-many. A single physical parameter can influence simultaneously many 
characteristics of the sound, and different physical parameters influence the same 
characteristics in different ways. As an example, changing the size of an object will 
scale the sound spectrum, i.e., will change the frequencies of the sound but not their 
pattern. On the other hand, changing the object’s shape results in a change in both the 
frequencies and their relationships. Acoustic invariants are thus the result of these 
complex patterns of change. Surveys of classic studies in ecological acoustics and 
acoustic invariants have been provided in previous works [5, 36]. 

The above discussion provides a solid theoretical framework to reason on the 
importance of ecologically valid acoustic information in eliciting the qualia of pres- 
ence [72] in an immersive VR system. Among the many definitions proposed in the 
literature, we follow Skarbez et al. [76] in defining presence broadly as “the per- 
ceived realness of a mediated or virtual experience”. Slater et al. [77] introduced the 
concepts of plausibility illusion and place illusion, to refer to two distinct subjective 
internal feelings, both of which contribute to eliciting the sense of presence in a 
subject experiencing an immersive VR scenario. This conceptual model of presence 
is depicted in Fig. 2.4.° 

In this section we are particularly interested in the plausibility illusion, i.e., the 
illusion that the scenario being depicted is actually occurring (we will discuss the 
place illusion in Sect. 2.3.3 next). This is determined by the overall credibility of a 
virtual environment in comparison with subjective expectations. Slater argued that an 
important component of the plausibility illusion is “for the virtual reality to provide 
correlations between external events not directly caused by the participant and his/her 
own sensations” [77]. Skarbez et al. [76] proposed the construct of coherence, an 
objective characteristic of a virtual scenario that gives rise to the plausibility illusion 
(see Fig. 2.4, right) and depends on the internal logical and behavioral consistency of 
the virtual experience, with respect to prior knowledge. Building on these definitions, 
we argue that sound will contribute to the plausibility illusion of a virtual scenario as 
long as coherence is ensured for the auditory modality, i.e., as long as sound carries 
relevant ecological information expected by the user’s everyday listening experience. 

It shall be noted that coherence makes no assumptions about the high fidelity of 
a virtual environment to the real world. Consequently, the plausibility illusion “does 
not require physical realism” [77]: several studies show that virtual characters or 
objects displayed with low visual fidelity in the virtual environment do not disrupt 
the illusion. With regard to the auditory domain, this observation may be related to 
the concept of cartoon sounds [69], i.e., simplified descriptions of sounding phe- 
nomena with exaggerated features. We argue that cartoon sounds do not disrupt the 


3 Skarbez et al. [76] consider a third component, the social presence illusion, which we do not 
address here. 
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Fig. 2.4 A conceptual model of presence: cloud boxes represent subjective internal feelings 
(qualia), circles represent functions affected by individual differences, and rounded rectangles rep- 
resent objective characteristics of the virtual experience. Figure based on Skarbez [76, Fig. 2] 


plausibility illusion as long as they still carry relevant ecological information. This 
is fundamentally the same principle exploited in the empirical science of Foley Art 
for creating ecologically plausible sound effects [2]. 


2.3.3 Active Perception, Place Illusion, Embodiment 


The “enactive” approach to experience posits that it is not possible to disassociate 
perception and action schematically and that every kind of perception is intrinsically 
active and thoughtful. One of the most influential contributions in this direction is due 
to Varela et al. [94]. In the authors’ conception, experience does not occur inside the 
perceiver, but rather it is enacted by the perceiver while exploring the environment. 
In this view, the subject of mental states is the embodied, environmentally situated 
perceiver. The term “embodied” highlights two main points: (1) perception depends 
upon the kinds of experience that are generated from specific motor capabilities, 
and (ii) these capabilities are themselves embedded in a biological, psychological, 
and cultural context. Sensory and motor processes are fundamentally inseparable, 
and perception consists in exercising an exploratory skill. As an example [58], the 
sensation of softness experienced when holding a sponge consists in being aware 
that one can exercise certain skills: one can press the sponge, and it will yield under 
the pressure. The experience of the softness of the sponge is characterized by a 
variety of such possible patterns of interaction. Sensorimotor dependencies, or con- 
tingencies, are the laws that describe these interactions. When a perceiver knows that 
he is exercising the sensorimotor contingencies associated with softness, then he is 
experiencing the sensation of softness. 
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Embodied theories of perception provide the ground for discussing further central 
concepts for VR, such as immersion, place illusion, sense of embodiment, and their 
relation to interactive sound. As depicted in Fig. 2.4 (left), immersion is an objective 
property of a VR system. Research has concentrated largely on characteristics such 
as latency, rendering frame rate, and tracking [22]. However, immersive systems can 
be also characterized in relation to the supported sensorimotor contingencies, which 
in turn define a set of valid actions that are perceptually meaningful (for instance, 
with a head-mounted display and head-tracking, it is possible to turn your head or 
bend forward producing changes in the rendered visual images). When a system 
supports sensorimotor contingencies that approximate those of physical reality, it 
can give rise to the place illusion, a specific subjective internal feeling which is the 
illusion of being located inside the rendered virtual environment, of “being there” 
[77]. Whereas the plausibility illusion is based on what a subject perceives in the 
virtual environment, the place illusion is based on how she is able to perceive it. 

The great majority of studies addressing explicitly the effect of sound on the 
place illusion are concerned with spatial attributes: this is not entirely surprising, 
since many of these attributes are perceived by exercising specific motor actions 
(e.g., rotating the head to perceive the distance or the direction of a sound source 
or a reflecting surface). In this respect, directivity is possibly the only sound source 
attribute contributing to the place illusion, while other ecological attributes are more 
likely to contribute to the plausibility illusion only, as discussed in Sect. 2.3.2. In 
accordance with this picture, over the years, various authors [11, 38, 60] found that 
spatialized sound positively influences presence as being there when compared to no- 
sound or non-spatialized sound conditions, but does not affect the perceived realism 
of the environment. A comprehensive survey up to 2010 is provided by Larsson [47]. 

The sense of embodiment refers to yet another subjective internal feeling. Specif- 
ically, the sense of embodiment in an immersive virtual environment is concerned 
with the relationship between one’s self and one’s body, whereas the sense of pres- 
ence refers to the relationship between one’s self and the environment (and may 
occur even without the sensation of having a body). Kilteni et al. [45] provide a 
working definition of a sense of embodiment toward an artificial body, as the sense 
that emerges when that artificial body’s properties are processed as if they were the 
properties of one’s own biological body. Further, the authors associate it to three main 
components: (i) the sense of self-location, (ii) of body ownership, and (iii) of agency, 
the latter being investigated as an independent construct by other researchers [17]. 

The sense of self-location refers to one’s spatial experience of being inside a body, 
rather than being inside a world (with or without a body), and is highly determined by 
the visuospatial perspective, proprioception, and vestibular signals, as well as tactile 
sensations at the border between our body and the environment. The sense of body 
ownership refers to one’s self-attribution of an artificial body perceived as the source 
of the experienced sensations and emerges as a complex combination of afferent 
multisensory information and cognitive processes that may modulate the processing 
of sensory stimuli, as demonstrated by the well-known rubber hand illusion [13]. 
The sense of agency refers to the sense of having global motor control in relation 
to one’s own body and has been proposed to result from a comparison between the 
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predicted and the actual sensory consequences of one’s actions [24]: when the two 
match by, for example, the presence of synchronous visuomotor correlations under 
active movement, one feels oneself to be the agent of those actions. 

The above discussion suggests that interactive sounds occurring directly in reac- 
tion to the avatar’s gestures in a virtual scenario, and coherently with the available 
sensorimotor contingencies, can positively affect the sense of agency in particular. 
One relevant example is provided by footsteps: several studies have addressed the 
issue of generating footstep sounds [14, 85, 95] however without assessing their 
specific relevance to the sense of agency. Other studies have shown that interactively 
generated sound can support haptic sensations, as in the case of impact sounds rein- 
forcing or modulating the perceived hardness of an impulsive contact [6], or friction 
sounds affecting the perceived effort in dragging an object [4] (refer to Chap. 12 
for other audio-haptic case studies). Yet, no attempt was made in these studies to 
specifically address the issue of agency. 

Even less research seems to have been conducted on the effects of interactive 
sound on the sense of body ownership. Footsteps provide a relevant example also 
in this case, as the sound of steps can be related to the perceived weight of one’s 
own body [85] or that of an avatar [74]. Sikstr6m et al. [73] evaluated the role of 
self-produced sounds in participants’ sensation of ownership of virtual wings in an 
immersive scenario. A related issue is that of the sound of one’s own voice in a virtual 
environment [61 ]. 


2.4 Events Versus Processes 


Having discussed the perceptual and cognitive aspects involved in interactive sound 
generation, we now jump back to the pipeline of Fig. 2.1 and look specifically at the 
“source modeling” box. 

When creating sound sources in a virtual environment, approaches based on sam- 
ple playback are still the most common ones [12], taking advantage of sound design 
techniques that have been refined through a long history, and being able to yield 
perfect realism, “at least for single sounds triggered only once” [21]. From a com- 
pletely different perspective, procedural approaches defer the generation of sound 
signals until runtime, when information on sound-producing event is available and 
can be used to yield more interactive sonic results. This section discusses these two 
dichotomical approaches. 


2.4.1 Event-Driven Approaches 


Approaches based on sample playback follow an event-driven logics, in which a 
specific sound exists as a waveform stored in a file or a table in memory and is 
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Fig. 2.5 Event-driven logics for VR sound design using samples and audio middleware software 


bound to some event occurring in the virtual world. Borrowing an example from 
Farnell [31]: i£ (moves (gate)) play(scrape.wav). 

One immediate consequence of this is that the playback and the post-processing 
of samples are dissociated from the underlying physics engine and the graphical 
rendering engine. In the case of a sound played back once, the length of the sound is 
predetermined and thus any timing relationship between auditory and visual elements 
must also be predefined. In the case of a looped sound, the endpoint must be explicitly 
given, e.g., aS a response to a subsequent event. More in general, the playback of 
sound is controlled by a finite and small set of states (such as in the case of an elevator 
that can be starting, moving, stopping, or stopped). Correspondingly, any event is 
bound to a sound “asset”, or to some post-processing of that asset. 

Current practices of sound design for VR are deeply and firmly rooted in such 
event-driven logics, depicted in Fig. 2.5. One clear example of this is provided by 
“audio middleware” software [12], which are tools that facilitate the work of the 
sound designer by reducing programming time and testing the sound design in real 
time along with the game engine. The most commonly adopted middleware solutions, 
such as FMOD Studio (Firelight Technologies)* and Wwise (Audiokinetic),> largely 
follow the traditional paradigm of DAWs (Digital Audio Workstations) and include 
GUIs for adding, controlling, and processing samples; linking them to objects, areas, 
and events of the virtual environment; and imposing rules for triggering and playback. 

One of the main acknowledged limitations of samples is that they are static, and 
they are just single, atomic instances of events. The repetitiveness involved in multiple 
playbacks of the same sounds has the potential to disrupt many of the perceptual and 
cognitive effects discussed in Sect. 2.3, and even to lead to fatigue. Partial remedies 
to this problem include the use of multiple samples for the same event, as well as the 


4 https://www.fmod.com/. 
5 https://www.audiokinetic.com/products/wwise/. 
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use of various post-processing operations, the most common being modifications to 
pitch, time, dynamics, as well as sound layering and looping [75]. 

Well-established time-stretching and pitch-shifting algorithms exist; however, the 
quality of the processing is in general guaranteed only for relatively small shifting and 
stretching factors. Concerning dynamics, typical approaches are based on blending, 
cross-fading, and mixing of different samples, similarly to a musical sampler (and 
with similar limitations as well). Layering and looping are especially useful for the 
construction of ambiences: multiple sounds can be individually looped and played 
concurrently to create complex and layered ambiences. Repetitiveness can be reduced 
by assigning different lengths to different loops, and immersion can be enhanced by 
rendering individual layers at different virtual spatial locations. All this requires 
manual operations by the sound designer, such as splitting, cross-fading, and so on. 

Further countermeasures to repetition and listener fatigue include the use of tech- 
niques based on randomization. These can be applied to many aspects of sound, 
including, but not limited to (i) pitch and amplitude variations, (ii) sample selection, 
(iii) sample concatenation, (iv) looping, and (v) location of sound sources. As an 
example, randomized sample selection amounts to performing randomizations of 
alternative samples associated to the same event, e.g., a collision: a different sample 
is played back at each occurrence of the event, mimicking the differences occurring 
due to slightly different contact points and velocities. In randomized concatenation, 
different samples are concatenated to build a composite sound in response to a repet- 
itive sequence of events, such as in the case of footsteps, weapon sounds, and so on. 
Triggering different points with different probabilities can also be used to reduce the 
repetitiveness of looped layers in ambience sounds. The audio middleware solutions 
mentioned above typically implement several of these techniques. 

Randomization techniques hint at another issue with samples, which is the need 
for very large amounts of data. Putting together a large sample library is a slow and 
labor-intensive process. Moreover, data need to be stored in memory, possibly in 
secondary storage, from which they then have to be prefetched before playback. 


2.4.2 Procedural Approaches 


Techniques based on the randomization of several sample-processing parameters, 
such as those discussed above, are sometimes loosely referred to as procedural in the 
sound design practice [75, Chap. 2]. Here, we favor a stricter definition. In Farnell’s 
words [30], procedural audio is “sound as a process, rather than sound as data”. 
This definition shifts the focus onto the creation of audio assets, as opposed to the 
manipulation of existing ones. 

Procedural audio is thus synthetic sound, is real time, and most importantly is 
created according to a set of programmatic rules and live input. This implies that 
procedurally generated sound is synthesized at runtime, when all the needed input 
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Fig. 2.6 Procedural sound design: a model building, and b method analysis stages (figures loosely 
based on Farnell [30, Figs. 16.4—5]) 
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and contextual information are available, whereas in a sample-based approach, most 
of the work is performed offline prior to execution, implying that “many decisions 
are made in advance and cast in stone” [31]. 

The stages involved in the process of procedural sound design may be loosely 
based on those of software life-cycle, including (i) requirements analysis, (11) research 
and acquisition, (iii) model building, (iv) method analysis, (v) implementation, (vi) 
integration, (vii) test and iteration, and (viii) maintenance. Figure 2.6 provides a 
graphical summary of the two central stages, i.e., model building and method analysis. 

Building a model (Fig. 2.6a) provides a simplification of the properties and behav- 
iors of a real object, which starts from the analysis of sound data (including time- 
and/or frequency-domain analysis, extraction of relevant audio features, etc.), as well 
as a physical analysis of the involved sound-generating mechanisms, and results into 
a set of parametric controls and behaviors. The hierarchy of everyday sounds depicted 
in Fig. 2.3 provides a useful reference framework: the model at hand can be positioned 
inside this hierarchy. Moreover, following the discussion on everyday listening of 
Sect. 2.3.2, the choice of the model parametrization can be informed by the knowl- 
edge of relevant acoustic invariants carrying information about sound-generating 
objects and events. 

The method analysis stage is where the most appropriate sound synthesis and 
processing techniques are chosen, starting from a palette of available ones, and based 
on the model at hand. Figure 2.6b shows a set of commonly employed sound synthesis 
techniques (in Sect. 2.4.3, we will explore physics-based techniques in particular). 
As a result of this stage, an implementation plan is produced that includes a set of 
techniques and corresponding low-level synthesis parameters, as well as the involved 
audio streams. 
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Based on this discussion, we can identify two main qualities of procedural 
approaches with respect to sample playback. The first one is their intrinsic adaptabil- 
ity and interactivity (according to the definitions given in Sect. 2.1), which derive 
from the deferring of sound generation at runtime based on programmatic rules and 
user input, and result in ever-changing sonic results in response to real-time control. 
The second one is flexibility, where a single procedural model can be parametrized 
to produce a variety of sound events within a given class of sounds: this contrasts 
with sample-based, event-driven approaches, where ever-increasing amounts of data 
and assets are needed in order to cope with the needs of complex virtual worlds. 


2.4.3 Physics-Based Methods 


Looking back at Fig. 2.6b, one of the available paints in the palette of sound synthesis 
techniques is that of physics-based methods. 

The boundaries between what can be considered physical (or physics-based, or 
physics-informed) sound synthesis are somewhat blurry in the scientific literature. 
Here, we adopt the definition given by Smith [78] and refer to synthesis techniques 
where “ [...] there is an explicit representation of the relevant physical state of the 
sound source. For example, a string physical model must offer the possibility of 
exciting the string at any point along its length. [...] All we need is Newton.” The 
last claim refers to the idea that physical modeling always starts with a quantitative 
description of the sound sources based on Newtonian mechanics. Such description 
may be approximate and simplified to various extents, but the above definition pro- 
vides an unambiguous—albeit broad—characterization in terms of physical state 
access. Resorting to a simple (yet historically relevant [68]) example, we can say 
that additive synthesis of bell sounds is not physics-based, as additive sinusoidal 
partials only describe the time-frequency characteristics of the sound signal without 
any reference to the physical state of the bell. On the other hand, modal synthesis [1] 
of the same bell, with modal oscillators tuned to the sound partials, is only apparently 
a similar approach: a linear combination of the modes can provide the displacement 
and the velocity at any point of the bell, and each modal shape defines to what extent 
an external force applied at a given point affects the corresponding mode. 

The history of physics-based synthesis is rooted in studies on the acoustics of the 
vocal apparatus [44] and of musical instruments [39, 40], where numerical models 
were initially used for simulation rather than synthesis purposes. Current techniques 
are based on several alternative formulations and methods, including ordinary or par- 
tial differential equations, equivalent circuit representations, modal representations, 
finite-difference and finite-element schemes, and so on [78]. Comprehensive surveys 
of physical modeling approaches have been published [79, 89]. Although these deal 
with musical sound synthesis mostly, much of what has been learned in that domain 
can be applied to the physical modeling of any sounding object. 

Although physics-based synthesis is sometimes made synonymous with proce- 
dural audio, Fig. 2.6b provides a clear picture of the relation between the two. In this 
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perspective, “procedural audio is more than physical modeling,” [31] and the latter 
can be seen as one of the tools at the disposal of the sound designer to reduce a sound 
to its behavioral realization. Combining physics-based approaches with knowledge 
of auditory perception and cognition often results in procedural models in which the 
physical description has been drastically simplified while retaining the ecological 
validity of sounds and the realism of the interactions, thus preserving the plausibility 
illusion of the resulting sonic world and the sense of agency of the subject (see related 
discussions in Sects. 2.3.2 and 2.3.3). 


2.4.4 Computational Costs 


Event-driven and procedural approaches must be analyzed also in terms of the 
involved computational requirements. In case of insufficient resources, excessive 
computational costs may introduce artifacts in the rendered sound or in alternative 
may require to increase the overall latency of the rendering up to a point where the 
perception of causality and multisensory integration are disrupted (see Sect. 2.3.1). 

With reference to Fig. 2.1, it can be stated that one main computational bottleneck 
in the sound simulation rendering pipeline [51] is the “per sound source” cost. This 
relates in particular to the sound propagation stage (see Chap. 3), as reflections, 
scattering, occlusions, Doppler effects, and so on must be computed for each sound 
source involved in the simulation. But it also includes the source modeling stage, 
with particular reference to the generation of the sound source signals. 

Sample playback has a fixed cost, irrespective of the sound being played. More- 
over, the cost of playback is very small. However, samples must be loaded in memory 
before being played. As a consequence, when a sound is triggered, the playback may 
involve a prefetch phase where a soundbank is loaded from the secondary memory. 
Moreover, some management of polyphony must be set in place in order to pri- 
oritize the playback in case of several simultaneously active sounds. This can use 
policies similar to those employed in music synthesizers: typically, sounds falling 
below a certain amplitude threshold are dropped, leaving place for other sounds. The 
underlying assumption is that louder sounds mask softer ones, so that dropping the 
latter has no or minimal perceptual consequences. Although modern architectures 
allow for the simultaneous playback of hundreds of audio assets, generating complex 
soundscapes may exceed the amount of available channels. 

On the other hand, procedural sound has variable costs, which depend on the 
complexity of the corresponding model and on the employed methods. This is par- 
ticularly evident in the case of physics-based techniques: for large-scale, brute-force 
approaches, like higher dimensional finite-element or finite-difference methods, real 
time is still hard to achieve. On the other hand, techniques like modal synthesis can 
be implemented very efficiently, albeit at the cost of reduced flexibility of the models 
(e.g., interaction with sounding objects limited to single input-output), which in turn 
can have a detrimental effect on the plausibility illusion. Some non-physical methods 
are very cheap in terms of computational requirements, as in the case of subtractive 
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synthesis for generating wind or fire sounds. Section 2.5.1 provides several examples 
of procedural methods for various classes of everyday sounds. 

Although it is generally true that sample-based methods outperform procedural 
audio for small amounts of sounds, it has been noted [30] that this is not necessarily 
true in the limit of larger numbers: whereas the fixed cost of sample playback results 
in a computational complexity that is linear in the number of rendered sources, the 
availability of very cheap procedural models can produce the result that for high 
numbers of sources the situation reverses and procedural sound starts to outperform 
sample-based methods. 


2.5 Procedural and Physics-Based Approaches in VR Audio 


Given these premises, what is the current development of procedural and physics- 
based approaches in audio for VR? In this section, we show that, despite a substantial 
amount of research, these approaches are still struggling to gain popularity in real- 
world products and practices. 


2.5.1 Methods 


Far from providing a comprehensive survey of previous literature in the field, which 
would go way beyond the scope of this chapter, this section aims at assessing to what 
extent the taxonomy of everyday sounds provided in Fig. 2.3 has been covered by 
existing procedural approaches. This exercise also serves as a testbed to verify the 
generality of that taxonomy. For a recent and broad survey, see Liu and Manocha [51]. 

Solid sounds are by far the most investigated category. For basic models, modal 
synthesis [1] is the dominant approach. There are several works investigating the 
use of modal methods for the procedural generation of contact sounds between solid 
objects, including the optimization of modal synthesis through specialized numer- 
ical schemes and/or perceptual criteria, as in the work by Raghuvanshi et al. [63]. 
Procedural models of surface textures have been proposed by several scholars [66, 
91] and applied to scraping and rolling sounds [64]. Basic interaction forces (impact 
and sliding friction) can be modeled with a variety of approaches that range from 
qualitative approximations of temporal profiles of impulsive force magnitudes [92] 
to the physical simulation of stick-slip phenomena in friction forces [7]. 

At the next level of complexity, models of patterned solid sounds have also been 
widely studied. Stochastic models of crumpling phenomena have been proposed, 
with applications to cloth sound synthesis [3], crumpling paper sounds, or sounds 
produced by deformations of aggregate materials, such as sand, snow, or gravel [15]. 
The latter have also been used in the context of walking interactions [81] (see also 
Sect. 2.3.3) to simulate the sound of a footstep onto aggregate grounds. Breaking 
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sounds have been modeled especially with the purpose of synchronizing animations 
of brittle fractures produced by a physics engine [59, 100]. 

The category of aerodynamic sounds is less studied. Within the basic level of 
complexity, the sound produced by wind includes those resulting from interaction 
with large static obstructions, aeolian tones, and cavity tones: these have been pro- 
cedurally modeled with techniques ranging from computationally intensive fluid- 
dynamics simulations [26] to simple (yet efficient and effective) subtractive schemes 
using noisy sources and filters [30]. These can be straightforwardly employed to 
construct patterned and compound sonic events, including windy scenes, swinging 
objects, and so on [71]. Other basic aeroacoustic events include turbulences, most 
notably explosions, which are a key component of more complex sounds such as gun- 
shots [37] and fire [18]. Yet another relevant patterned sonic event is that produced 
by combustion engines [10]. 

Liquid sounds appear to be the least addressed category. Basic procedural models 
include sounds produced by drops in a liquid [90] or by pouring a liquid [65], whereas 
patterned and compound sonic events have been more often simulated using concate- 
native approaches relying on the output of the graphical procedural simulation [98]. 
A relevant example of hybrid solid-liquid sounds is that of rain [50]. 


2.5.2 Optimizations 


We have provided in Sect. 2.4.4 a general discussion on computational costs asso- 
ciated to procedural approaches, in comparison to sample-based methods. Since the 
former typically results in higher “per sound source” costs than the latter, various 
studies have proposed strategies for reducing the load of complex procedural audio 
scenes in virtual environments. 

One attractive feature of procedural sound in terms of computational complexity is 
the possibility of dynamically adapting the level of detail (LOD) of the synthesized 
audio. The concept of LOD is a long-established one in computer graphics and 
encompasses various optimization techniques for decreasing the complexity of 3D 
object rendering [52]. The general goal of LOD techniques is to increase the rendering 
speed by reducing details while minimizing the perceived degradation of quality. 
Most commonly, the LOD is varied as a function of the distance from the camera, 
but other metrics can be used, including size, speed of motion, priority, and so on. 
Reducing the LOD may be achieved by simplifying the 3D object mesh, or by 
using impostors (i.e., replacing mesh-based with image-based rendering), and other 
approaches can be used to dynamically control the LOD of landscape rendering, 
crowd simulation, and so on. 

Similar ideas may be applied to procedural sound, achieving further reductions 
of computational costs for complex sound scenes with respect to sample playback. 
However, very few studies explored the concept of LOD in the auditory domain, 
and there is not even a commonly accepted definition in the related literature: some 
scholars have coined the term Sound Level Of Detail (SLOD) [70], while others use 
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Fig. 2.7 Example of dynamic LOAD based on the radial distance from the listener, where levels 
of details are associated to three overlapping proximity profiles. Figure partly based on Schwarz 
et al. [70, Fig. 3] 


Level Of Audio Detail (LOAD) [27], both generically referring to varying sound 
resolution according to the required perceived precision. Here, we stick to the latter 
definition (LOAD), since this seems to be more frequently adopted in recent literature. 

Strategies for dynamic LOAD can be partly derived from graphics. Simple 
approaches amount to fade out and turn off distant sounds based on radial distance 
or zoning. Depending on their distance, sound sources may be also clustered or acti- 
vated according to some predefined behavior. Techniques based on impostors can be 
used as well: as an example, when rendering the sound of a crowd, individual sounds 
emitted by several characters can be replaced by a global sample-based ambience 
sound. However, one should be aware of the differences between visual and audi- 
tory perception and exploit the peculiarities of the latter to develop more advanced 
strategies for dynamic LOAD. Figure 2.7 depicts an example of a dynamic LOAD 
strategy based on radial distance, in which levels of details are associated to three 
overlapping proximity profiles around the listener (foreground, middle ground, and 
background): sounds in the foreground are rendered individually through procedural 
approaches; those that fall into the middle ground can be rendered through some 
simplifying approaches (clustering, grouping, and statistical behaviors); and finally, 
sounds in the background may be substituted by audio impostors such as audio files. 

Pioneering work in this direction was carried out by Fouad et al. [32], although 
the authors did not explicitly refer to the concept of LOD. This work proposes a 
set of “perceptually based scheduling algorithms”, that allows a scheduler to assign 
execution time to each sound in the scene minimizing some perceptually motivated 
error metric. In particular, sounds are prioritized depending on the listener’s gaze, 
the loudness, and the age of the sound. Tsingos and coworkers [56, 88] proposed 
an approach to reduce the number of (sample-based) sound sources in a complex 
scenario, by combining perceptual culling and perceptual clustering. The culling 
stage removes perceptually inaudible sources based on a global masking model, while 
the clustering stage groups the remaining sound sources into a predefined number of 
clusters: as a result, a representative point source is constructed for each cluster and 
a set of equivalent source signals is generated. Schwarz et al. [70] proposed a design 
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with three LOADs based on proximity and smooth transitions between proximity 
levels, very much like those depicted in Fig. 2.7: (i) foreground, i.e., individually 
driven sound events (e.g., individual raindrops on tree leaves); (ii) middle ground, 
i.e., group-driven sound events, at the point where individual events cannot be isolated 
and can be replaced by stochastical behaviors; (iii) background, i.e., sound sources 
that are further away and can be rendered by audio impostors such as audio files 
or dynamic mixing of groups of procedural impostors. More recently, Dall’ Avanzi 
et al. [23] analyzed the effect on player’s immersion in response to soundscapes with 
two applied LOADs. Two groups of participants played two different versions of the 
same game, and the player’s immersion was measured through two questionnaires. 
However, results in this case showed no considerable difference between the two 
groups. 

Other researchers proposed or evaluated LOAD techniques specifically tailored 
to certain synthesis methods. Raghuvanshi et al. [63] addressed modal synthesis and 
investigated various perceptually motivated techniques for improving the efficiency 
of the synthesis. These include a “quality scaling” technique that effectively controls 
the dynamic LOAD: briefly, in a scene involving many sounding objects, the number 
of modes assigned to individual objects scales with objects location from foreground 
to background, without significant losses in perceived quality. Durr et al. [27] evalu- 
ated through subjective tests various procedural models of sound sources with three 
applied LOADs. Specifically, three procedural models proposed by Farnell [30] (see 
also Sect. 2.5.1) were chosen for investigation: (i) fire sounds employ subtractive syn- 
thesis to generate and combine hissing, crackling, and lapping features; (ii) bubbles 
sounds use a form of additive synthesis with frequency- and amplitude-controlled 
sinusoidal components representing single bubbles; (iii) wind sounds are again pro- 
duced using subtractive synthesis (amplitude-modulated noise and various filtering 
elements to represent different wind effects). A different approach to applying LOAD 
was implemented for each model. Correspondingly, listening tests provided different 
results for each model in terms of perceived quality at different LOADs. 

The reader interested in further discussion about audio quality should also refer 
to Chap. 5. 


2.5.3 Tools 


In spite of all the valuable research results produced so far, there is still a lack of 
software tools that assist the sound designer in using procedural approaches. 
Designers working with procedural audio use a variety of audio program- 
ming environments. Popular choices include (but are not limited to) Pure Data,° 
Max/MSP,’ or CSound.® The first two in particular implement a common, dataflow- 
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oriented paradigm [62] and use a visual patch language where “the diagram is the 
program”: Farnell [31] argues that this paradigm is particularly suited for proce- 
dural audio as it has a natural congruence with the abstract model resulting from 
the design process. On the other hand, integrating these environments into the most 
widespread gaming/VR engines is not straightforward: at the time of writing, some 
active open-source projects include libpd [16], a C library that turns Pure Data into an 
embeddable audio synthesis library and provides wrappers for a range of languages, 
and Cabbage [97], a framework for developing audio plugins in Csound, includ- 
ing plugins for the FMOD middleware. Commercial gaming/VR engines typically 
provide limited functionalities to support procedural sound design, although some 
recent developments may hint at an ongoing change of perspective: as an example, 
the Blueprint visual scripting system within the Unreal Engine has been used for 
dataflow-oriented procedural audio programming, also using some native synthesis 
(subtractive, etc.) capabilities. 

All of the tools mentioned above still require to work at a low level of abstraction, 
implying that the sound designer must have the technical skills needed to deal with 
low-level synthesis methods and parameters, and at the same time limiting produc- 
tivity. There is a clear need for tools that allow the designer to work at higher levels 
of abstraction. One instructive example is provided by the Sound Design Toolkit 
(SDT), an open-source software package developed over several years [9, 25] which 
provides a set of sound models for the interactive generation of several acoustic phe- 
nomena. In its current embodiment, SDT is composed of a core C library exposing 
an API, plus a set of wrappers for Max and Pure Data, and a related collection of 
patches and help files. Interestingly, the collection is based on a hierarchical taxon- 
omy of everyday sound events which follows very closely the one depicted in Fig. 2.3 
and implements a rich subset of its items. The designer has access to both low-level 
parameters (e.g., the modal frequencies of a basic solid resonator) and to high-level 
ones (e.g., the initial height of a bouncing object). 

Commercial products facilitating the designer’s workflow are also far from abun- 
dant: Lesound? (formerly Audiogaming) sells a set of plugins for FMOD and Wwise 
that include procedural simulations of wind, rain, motor, and weather sounds, while 
for its part AudioKinetic (developer of Wwise) develops the soundseed plugin series, 
which include procedural generation of wind and whooshing sounds as well as 
impact sounds. Nemisindo!” provides a web-based platform for real-time synthe- 
sis and manipulation of procedural audio, which stems from the FXive academic 
project [8], but no plugin-based integration with VR engines or audio middleware 
software is available at the time of writing. 

A much-needed facilitating tool for the sound designer is one that automates part 
of the design process, allowing in particular for automatic tuning of the parameters of 
a procedural model starting from a target (e.g., recorded) sound. This would provide 
a means to recreate procedurally a desired sound and more in general to ease the 
design by providing a starting set of parameter values that can be further edited. 


? https://lesound.io/. 
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In the context of modal synthesis, various authors have proposed automatic analy- 
sis approaches for determining modal parameters from a target signal (e.g., an impact 
sound). In this case, the parametrization of the model is relatively simple: every mode 
at a given position is fully characterized by a triplet of scalars representing its fre- 
quency, decay coefficient, and gain. This generalizes to an array of gains if multiple 
points on the object are considered, or to continuous modal shapes as functions of 
spatial coordinates on the object. Ren et al. [67] proposed a method that extracts 
perceptually salient features from audio examples and a parameter estimation algo- 
rithm searches for the best material parameters for modal synthesis. Based on this 
work, Sterling et al. [82] added a probabilistic model for the damping parameters 
in order to reduce the effect of external factors (object support, background noise, 
etc.) and non-linearities on the estimate of damping. Tiraboschi et al. [87] also pre- 
sented an approach to the automatic estimation of modal parameters based on a target 
sound, which employs a spectral modeling algorithm to track energy envelopes of 
detected sinusoidal components and then performs linear regression to estimate the 
corresponding modal parameters. 

While the case of solid objects and modal synthesis is a relatively simple one, 
the issue of automatic parameter estimation has been largely disregarded for other 
classes of sounds and models. 


2.6 Conclusions 


Our discussion in this chapter has hopefully shown that procedural approaches 
offer extensive possibilities for designing sonic interactions in virtual environments. 
And yet as of today the number of real-world applications and tools utilizing these 
approaches is very limited. In fact, not much has changed since ten or fifteen years 
ago, when other researchers observed a similar lack of interest from the industry [12, 
29], with the same technical and cultural obstacles to adoption still in place. In a way 
recent technological developments have further favored the use of sample-based 
approaches: in particular, decreasing costs of RAM and secondary storage, as well 
as optimized strategies to manage caching and prefetching of sound assets, have 
made it possible to store ever larger amounts of data. This state of affairs mimics 
closely what happened in the music industry during the last three decades: physics- 
based techniques in particular have been around for a long time, but the higher sound 
quality and accuracy of samples are still preferred over the flexibility of physical 
models for the emulation of musical instruments. 

Perhaps then the question is not whether procedural approaches can overcome 
sample-based audio, but when, i.e., under what specific circumstances. In this chapter, 
we have provided some elements, particularly links to a number of relevant percep- 
tual and cognitive aspects, such as the plausibility and place illusions, the sense of 
embodiment, and the sense of agency. We argue that procedural audio can compete 
with samples in cases where either (i) very large amounts of data are needed to min- 
imize repetition and support the plausibility illusion, or (ii) interactivity is needed 
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beyond an event-driven logics, in order to provide tight synchronization and plausible 
variations with user actions, and to support her sense of agency and body ownership. 

One example of the first circumstance is provided by wind sounds: good record- 
ings of real wind effects are technically difficult to come by and long recordings are 
required to create convincing ambiences of windy scenes using looping, while on 
the other hand procedurally generated wind sounds achieve high levels of realism. 
It is therefore no surprise that the few commercially available tools for procedural 
sound all include wind (see Sect. 2.5.3) and have been successfully employed also in 
large productions.'! While wind falls in the category of adaptive, rather than interac- 
tive sounds, two relevant examples for the second circumstance may be provided by 
footsteps and sliding friction (bike breaking, hinges squeaking, rubbing, etc.): beside 
requiring large amounts of data and randomization to avoid repetition, these sounds 
arise in response to complex and continuous motor actions by the user, which cannot 
be fully captured by an event-driven logics. 

Future research and development should therefore focus on cases where proce- 
dural models can compete with samples, looking more deeply into the effects on the 
plausibility illusion, sense of agency, and sense of body ownership. From a more tech- 
nical perspective, promising directions for future research include the development 
of dynamic LOAD techniques, as well as high-level authoring tools and automation. 
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Chapter 3 A) 
Interactive and Immersive Auralization Cheak for 
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Abstract Real-time auralization is essential in virtual reality (VR), gaming, and 
architecture to enable an immersive audio-visual experience. The audio rendering 
must be congruent with visual feedback and respond with minimal delay to interactive 
events and user motion. The wave nature of sound poses critical challenges for 
plausible and immersive rendering and leads to enormous computational costs. These 
costs have only increased as virtual scenes have progressed away from enclosures 
toward complex, city-scale scenes that mix indoor and outdoor areas. However, hard 
real-time constraints must be obeyed while supporting numerous dynamic sound 
sources, frequently within a tightly limited computational budget. In this chapter, we 
provide a general overview of VR auralization systems and approaches that allow 
them to meet such stringent requirements. We focus on the mathematical foundation, 
perceptual considerations, and application-specific design requirements of practical 
systems today, and the future challenges that remain. 


3.1 Introduction 


Audition and vision are unique among our senses: they perceive propagating waves. 
As aresult, they bring us detailed information not only of our immediate surroundings 
but of the world much beyond as well. Imagine talking to a friend in a cafe, the door is 
open, and outside is a bustling city intersection. While touch and smell give a detailed 
sense of our immediate surroundings, sight and sound tell us we are conversing with a 
friend, surrounded by other people in the cafe, immersed in a city, its sounds streaming 
in through the door. Virtual reality ultimately aims to re-create this sense of presence 
and immersion in a virtual environment, enabling a vast array of applications for 
society, ranging from entertainment to architecture and social interaction without 
the constraints of distance. 
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Rendering. To reproduce the audio-visual experience given in the example above, 
one requires a dynamic, digital 3D simulation of the world describing how both light 
and sound would be radiated, propagated, and perceived by an observer immersed 
in the computed virtual fields of light and sound. The world model usually takes 
the form of a 3D geometric description composed of triangulated meshes and sur- 
face materials. Sources of light and sound are specified with their 3D positions and 
radiative properties, including their directivity and the energy emitted within the 
perceivable frequency range. Given this information as input, special algorithms 
produce dynamic audio-visual signals that are displayed to the user via screens and 
speaker arrays or stereoscopic head-mounted displays and near-to-ear speakers or 
headphones. This is the overall process of rendering, whose two components are 
visualization and auralization (or visual- and audio-rendering). 

Rendering has been a central problem in both the graphics and audio communities 
for decades. While the initial thrust for graphics came from computer-aided design 
applications, within audio, room acoustic auralization of planned auditoria and con- 
cert halls was a central driving force. The technical challenge with rendering is that 
modeling propagation in complex worlds is immensely compute-intensive. A naive 
implementation of classical physical laws governing optics and acoustics is found 
to be many orders of magnitude slower than required (elaborated in Sect. 3.2.1). 
Furthermore, the exponential increase in compute power governed by Moore’s law 
has begun to stall in the last decade due to fundamental physical limits [97]. These 
two facts together mean that modeling propagation quickly enough for practical use 
requires research into specialized system architectures and simulation algorithms. 


Perception and Interactivity. A common theme in rendering research is that quanti- 
tative accuracy as required in engineering applications is not the primary goal. Rather, 
perception plays the central role: one must find ways to compute those aspects of 
physical phenomena that inform our sensory system. Consequently, initial graphics 
research in the 1970s focused on visible-surface determination [54] to convey spa- 
tial relations and object silhouettes, while initial room acoustics research focused 
on reverberation time [60] to convey presence in a room and indicate its size. With 
that foundation, subsequent research has been devoted toward increasing the amount 
of detail to reach “perceptually authentic” audio-visual rendering: one that is indis- 
tinguishable from an audio-visual capture of a real scene. Research has focused on 
the coupled problems of increasing our knowledge of psycho-physics, and designing 
fast techniques that leverage this knowledge to reduce computation while providing 
the means to test new psycho-physical hypotheses. 

The interactivity of virtual reality and games adds an additional dimension of 
difficulty. In linear media such as movies, the sequence of events is fixed, and com- 
putation times of hours or days for pre-rendered digital content can be acceptable, 
with human assistance provided as necessary. However, interactive applications can- 
not be pre-rendered in this way, as the user actions are not known in advance. Instead, 
the computer must perform real-time rendering: as events unfold based on user input, 
the system must model how the scene would look and sound from moment to moment 
as the user moves and interacts with the virtual world. It must do so with minimal 
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latency of about 10-100 ms, depending on the application. Audio introduces the 
additional challenge of a hard real-time deadline. While a visual frame rendered 
slightly late is not ideal but perhaps acceptable, audio lags may result in silent gaps 
in the output. Such signal discontinuities annoy the user and break immersion and 
presence. Therefore, auralization systems in VR tend to prioritize computational 
efficiency and perceptual plausibility while building toward perceptual authenticity 
from that starting point. 


Goal. The purpose of this chapter is to present the fundamental concepts and design 
principles of modern real-time auralization systems, with an emphasis on recent 
developments in virtual reality and gaming applications. We do not aim for an exhaus- 
tive treatment of the theory and methods in the field. For such a treatment, we refer 
the reader to Vorlander’s treatise on the subject [102]. 


Organization. We begin by outlining the computational challenges and the result- 
ing architectural design choices of real-time auralization systems in Sect. 3.2. This 
architecture is then formalized via the Bidirectional Impulse Response (BIR), Head- 
Related Transfer Functions (HRTFs), and rendering equation in Sect. 3.3. In Sect. 3.4, 
we summarize relevant psycho-acoustic phenomena in complex VR scenes and elab- 
orate on how one must balance a believable rendering with real-time constraints 
among other system design factors in Sect. 3.5. We then discuss in Sect. 3.6 how the 
formalism, perception, and design constraints come together into the deterministic- 
statistical decomposition of the BIR, a powerful idea employed by most auralization 
systems. Section 3.7 provides a brief overview of the two common approaches to 
acoustical simulation: geometric and wave-based methods. In Sect. 3.8, we discuss 
some example systems in use today in more depth, to illustrate how they balance the 
various constraints informing their design decisions, followed by the conclusion in 
Sect. 3.9. 


3.2 Architecture of Real-time Auralization Systems 


In this section, we discuss the specific physical aspects of sound that make it compu- 
tationally difficult to model, which motivates a modular, efficient system architecture. 


3.2.1 Computational Cost 


To understand the specific modeling concerns of auralization, it helps to juxtapose 
with light simulation in games and VR applications. In particular 


e Speed: The propagation speed of sound is low enough that we perceive its various 
transient aspects such as initial reflections and reverberation, which carry distinct 
perceptible information, while light propagation can be treated as instantaneous; 
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e Phase: Everyday sounds are often coherent or harmonic signals whose phase 
must be treated carefully throughout the auralization pipeline to avoid audible 
distortions such as signal discontinuities, whereas natural light sources tend to be 
incoherent; 

e Wavelength: Audible sound wavelengths are comparable to the size of architec- 
tural and human features (cm to m) which makes wave diffraction ubiquitous. 
Unlike visuals, audible sound is not limited by line of sight. 


Given the unique characteristics of sound propagation outlined above, auralization 
must begin with a fundamental treatment of sound as a transient, coherent wave 
phenomenon, while lighting can assume a much simpler geometric formulation of ray 
propagation for computing a stochastic, steady-state solution [57]. Auralization must 
carefully approximate the relevant physical mechanisms underlying the vibration of 
objects, propagation in air, and scattering by the listener’s body. All these mechanisms 
require modeling highly oscillatory wave fields that must be sufficiently sampled in 
space and time, giving rise to the tremendous computational expense of brute-force 
simulation. 

Assume some physical domain of interest with diameter D, the highest frequency 
of interest Vmax and speed of propagation c. The smallest propagating wavelength 
of interest is C/Vmax. Thus, the total degrees of freedom in the space-time volume 
of interest are Ngof = (2DVmax/ c)*. The factor of two is due to the Nyquist limit 
which enforces two degrees of freedom per oscillation. As an example, for full 
audible bandwidth simulation of sound propagation up to Vmax = 20,000 Hz in a 
scene that is D = 100 m across, with c = 340 m/s in air: Ngo = 1.9 x 10!°. For 
an update interval of 60 ms to meet latency requirements for interactive listener 
head orientation updates [22], one would thus need a computational rate of over 100 
PetaFLOPS. By comparison, a typical game or VR application will allocate a single 
CPU core for audio with a computational rate in the range of tens of GigaFLOPS, 
which is too slow by a factor of at least one million. This gap motivates research in 
the area. 


3.2.2 Modular Design 


Since pioneering work in the 1990s such as DIVA [86, 96], most real-time auraliza- 
tion systems follow a modular architecture shown in Fig. 3.1. This architecture results 
in a flexible implementation and significant reduction of computational complexity, 
without substantially impacting simulation accuracy in cases of practical interest. 

Rather than simulating the global scene as a single system which might be pro- 
hibitively expensive (see Sect. 3.2.1), the problem is divided into three components 
in a causal chain without feedback: 


e Production: Sound is first produced at the source due to vibration, which, com- 
bined with local self-scattering, results in a direction-dependent radiated source 
signal; 


3 Interactive and Immersive Auralization 81 


ee geometric model 
@G sound signal & materials 


directional 
sound field 
headphones speakers 


Propagation Spatialization 


Fig. 3.1 Modular architecture of real-time auralization systems. The propagation of sound emitted 
from each source is simulated within the 3D environment to compute a directional sound field 
immersing the listener. This field is given to the spatializer component that computes appropriate 
transducer signals for headphone or speaker playback 


e Propagation: The radiated sound diffracts, scatters, and reflects in the scene to 
result in a direction-dependent sound field at the listener location; 

e Spatialization: The sound field is heard by the listener. The spatialization com- 
ponent computes transducer signals for playback, taking the listener’s head orien- 
tation into account. In the case of using headphones, this implies accounting for 
scattering due to the listener’s head and shoulders, as described by the head-related 
transfer function (HRTF). 


Our focus in this chapter will be on the latter two components; sound production tech- 
niques such as physical-modeling synthesis are covered in Chap. 2. Here, we assume 
a source modeled as a (monophonic) radiated signal combined with a direction- 
dependent radiation pattern. 

This separation of the auralization problem into different components is key for 
efficient computation. Firstly, the perceptual characteristics of all three components 
may be studied separately and then approximated with tailored numerical meth- 
ods. Secondly, since the final rendering is composed of these separate models, they 
can be flexibly modified at runtime. For instance, a source’s sound and directivity 
pattern may be updated, or the listener orientation may change, without expensive 
re-computation of global sound propagation. Section 3.3 will formalize this idea. 


Limitations. This architecture is not a good fit for cases with strong near-field inter- 
action. For instance, if the listener’s head is close to a wall, there can be non-negligible 
multiple scattering, so the feedback between propagation and spatialization cannot 
be ignored. This can be an important scenario in VR [69]. Similarly, if one plays a 
trumpet with its bell very close to a surface, the resonant modes and radiated sound 
will be modified, much like placing a mute, which is a case where there is feedback 
between all three components outlined above. Thus, numerical simulations for musi- 
cal acoustics tend to be quite challenging. The interested reader can consult Bilbao’s 
text on the subject [12] and more recent overview [14]. In the computer graphics 
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community, the work in [104] also shows sound production and propagation mod- 
eled directly without the separability assumption, with special emphasis on handling 
dynamic geometry, for application in computer animation. Such simulations tend to 
be off-line, but modern graphics cards have become fast enough for approximate 
modeling of interactive 2D wind instruments in real-time [6]. 


3.2.3 Propagation 


The propagation component takes the locations of a source and listener in the scene 
to predict the scene’s acoustic response, modeling salient effects such as diffracted 
occlusion, initial reflections, and reverberation. Combined with source sounds and 
radiation patterns, it outputs a directional sound field to the listener. Propagation is 
usually the most compute-intensive portion of an auralization pipeline, motivating 
many techniques and systems, which we will discuss in Sects. 3.7 and 3.8. The 
methods have two assumptions in common. 


Linearity. For most auralization applications, it is safe to assume that sound ampli- 
tudes remain low enough to obey ideal linear propagation, modeled by the scalar 
wave equation. As a result, the sound field at the listener location is a linear sum- 
mation of contributions from all sound sources. There are some cases in games and 
VR when the assumption of linearity may be violated, for instance with explosions 
or brass instruments. In most such cases, the non-linear behavior is restricted to the 
vicinity of the event and may be treated via a first-order perturbative approximation 
which amounts to linear propagation with a locally varying sound speed [4, 27]. 


Quasi-static scene configuration. Interactive scenes are dynamic, but most prop- 
agation methods assume that the problem may be treated as quasi-static. At some 
fixed update rate, such as a visual frame, they take a static snapshot of the scene 
shape as well as the locations of the source and listener within it. Then propagation 
is modeled assuming a linear, time-invariant system for the duration of the visual 
frame. The computed response for each sound source is smoothly interpolated over 
frames to ensure a dynamic rendering free of artifacts to the listener. 

Fast-moving sources need to be treated with additional care as direct interpolation 
of acoustic responses can become error-prone [80]. An important related aspect is 
the Doppler Shift on first arrival, a salient, audible effect. It may be approximated 
in the source model by modifying the radiated signal based on source and listener 
velocities, or by interpolating the propagation delay of the initial sound. Another 
case violating the quasi-static assumption are aero-acoustic sounds radiated from 
fast object motion through the air. These can be approximated within the source 
model with Lighthill’s acoustic analogy [53], with subsequent linear propagation for 
real-time rendering [30, 31]. 
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3.2.4 Spatialization 


In a virtual reality scenario, the target of the audio rendering engine is typically a 
listener located within the virtual scene experiencing the virtual acoustic environment 
with both ears. For this experience to feel plausible or natural, sound should be 
rendered to the user’s ears as if they were actually present in the virtual scene. The 
architecture in Fig. 3.1 neglects the effect of the listener on global sound propagation. 
The spatialization system (shown to the right in the figure) inserts the listener virtually 
into the scene and requires additional processing. A properly spatialized virtual sound 
source should be perceived by the listener as emanating from a given location. In the 
simplest case of free-field propagation, a sound source can be positioned virtually by 
convolving the source signal with a pair of filters (also known as head-related transfer 
functions (HRTFs)). This results in two ear input signals that can be presented directly 
to the listener over headphones. For a more complex virtual scene containing multiple 
sound sources as well as their acoustic interactions with the virtual environment, 
spatialization entails encoding appropriate localization cues to the sound field at 
the listener’s ear entrances. Common approaches include spherical-harmonics based 
rendering (“Ambisonics”’) [42, 67] as well as object-based rendering [17]. 


HRTFs. If the sound is played back to the listener via headphones, this implies 
simulating the filtering that sound undergoes in a real sound field as it enters the 
ear entrances, due to reflections and scattering from the listener’s torso, head, and 
pinnae. A convenient way to describe this filtering behavior is via the HRTFs. The 
HRTFs are a function of the direction of arrival and contain the localization cues 
that the human auditory system decodes to determine the direction of an incoming 
wavefront. HRTFs for a particular listener are usually constructed via measurements 
in an anechoic chamber [40], though recent efforts exist to derive HRTFs for a listener 
on the fly without an anechoic chamber [50, 61], by adapting or personalizing existing 
HRTF databases using anthropometric features [15, 38, 41, 89, 106], or by capturing 
image or depth data to model the HRTFs numerically [20, 58, 65]. For a review of 
HRTF personalization techniques, refer to Chap. 4 and see [48]. The HRTFs can be 
tabulated as two spherical functions H” (s, t) that encapsulate the angle-dependent 
acoustic transfer in the free field to the left and right ears. The set of incident angles 
s contained in the HRTF dataset is typically dictated by the HRTF measurement 
setup [5, 39]. The process of applying HRTFs to a virtual source signal to encode 
localization cues is referred to as binaural spatialization. 

Spatialization for loudspeaker arrays is also possible, commonly performed using 
channel-based methods such as Vector Base Amplitude Panning [72] or Ambison- 
ics [42]. It is also possible to physically reproduce the virtual directional sound field 
using Wave Field Synthesis [2] with large loudspeaker arrays. For the rest of this 
chapter, we will focus on binaural spatialization, although most of the discussion can 
be easily adapted to loudspeaker reproduction as discussed in Chap. 5. 


Spherical-harmonics based rendering. Various methods exist to spatialize acoustic 
scenes. A convenient description of directional fields is via spherical harmonics 
(SHs) or Ambisonics [43]. Given a SH representation of a scene, binaural ear input 
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signals can be obtained directly via filtering with a SH representation of the listener’s 
HRTFs [29]. However, encoding complex acoustic scenes to SHs of sufficiently high 
order while minimizing audible artifacts can be challenging [10, 11, 19, 51]. The 
openly available Resonance Audio [47] system follows this approach. 


Object-based rendering. In this chapter, we will follow the direct parameterization 
over time and angle of arrival which is also common in practice, such as the illustrative 
auralization system we discuss in Sect. 3.8.4. The system directly outputs signals 
and directions, suitable for spatialization by applying appropriate HRTF pairs. The 
description of the acoustic propagation problem from a source to the listener in terms 
of a directional sound field as presented in Sect. 3.3.4 results in a convenient interface 
between the propagation model and the spatialization engine. 

This provides three major advantages. Firstly, it enables a modular system design 
that treats propagation modeling and (real-time) spatialization as separate problems 
that are solved by independent sub-systems. This separation in turn allows improving 
and optimizing the sub-systems individually and can lead to significant computa- 
tional cost savings. Secondly, a description of a sound field enveloping the listener 
in terms of time and angle of arrival is equivalent to an object-based representa- 
tion, which is a well-established input format for existing spatialization software, 
thus allowing the system designer to build easily on existing spatialization systems. 
Finally, psycho-acoustic research on perceptual limits of human spatial hearing, such 
as just-noticeable-differences, are expressed as a function of time and angle of arrival 
(Sect. 3.4). Knowledge of these perceptual limits can be exploited for further com- 
putational savings. 


3.3 Mathematical Model 


Auralization may be formalized as a linear, time-invariant process as follows. Assume 
a quasi-static state of the world at the current visual frame. To auralize a sound source, 
consider its current pose (position and orientation) to determine its directional sound 
radiation and then model propagation and spatialization as a feed-forward chain of 
linear filters. Those filters in turn depend on the current world shape and listener 
pose, respectively. 


Notation. For the remainder of this chapter, for any quantity (x) referring to the 
listener, we use prime («’) to denote a corresponding quantity referring to the source. 
In particular, x is listener location and x’ source location. Temporal convolution is 
denoted by x. 
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3.3.1 The Green’s Function 


With the linearity and time-invariance assumptions, along with the absence of mean 
flow or wind, the Navier-Stokes equations simplify to the scalar wave equation that 
models propagating longitudinal pressure deviations from quiescent atmospheric 
pressure [70]: 


[(1/c*) a7 — V2] p (t, x, x’) = 6 (6 (x — x’), (3.1) 


where c = 340 m/s is the speed of sound, V? the 3D Laplacian operator ranging over 
x. The solution is performed on some 3D domain provided by the scene’s shape, with 
appropriate boundary conditions to model the frequency-dependent absorptivity of 
physical materials. 

Sound propagation is induced by a pulsed excitation at time t = 0 and source 
location x’ with 5(-) denoting the Dirac delta function. The solution p(t, x, x’) is 
Green’s function that fully describes the scene’s global wave transport, including 
diffraction and scattering. The principle of acoustic reciprocity ensures that source 
and listener positions are interchangeable [70]: 


p(t, x, x’) = p(t, x’, x). (3.2) 


For treating general scenes, a numerical solver must be employed to discretely sample 
Green’s function in space and time. This includes accurate wave-based methods that 
directly solve for the time-evolving field on a grid, or fast geometric methods that 
employ the high-frequency Eikonal approximation. We will discuss solution methods 
in Sect. 3.7. 

In principle, Green’s function has complete information [3], including direction- 
ality, which can be extracted via spatio-temporal convolution of p(t, x, x’) with 
volumetric source and listener distributions that can model arbitrary radiation pat- 
terns [13] and listener directivity [91]. But such an approach is too expensive for 
real-time evaluation on large scenes, requiring temporal convolution and spatial 
quadrature over sub-wavelength grids that need to be repeated when either the source 
or listener moves. Geometric techniques cannot follow such an approach at all, as 
they do not model wave phase. 

This is where modularity (Sect. 3.2.2) becomes indispensable: the source and 
listener are not directly included within the propagation simulation, but are instead 
incorporated via tabulated directivity functions that result from their local radiation 
and scattering characteristics. Below, we formulate the propagation component of 
this modular approach, beginning with the simplest case of an isotropic source and 
listener, building up to a fully bidirectional representation that can be combined with 
arbitrary source and listener directivity during rendering. 
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3.3.2 Impulse Response 


Consider an isotropic (omni-directional) sound source located at x’ that is emitting 
a coherent pressure signal q'(t). The resulting pressure signal at listener location x 
can be computed using a temporal convolution: 


q(t; x, x’) = q' (t) * p(t; x, x’). (3.3) 


Here, p(t; x, x’) is obtained by evaluating Green’s function between the listener and 
source locations (x, x’). We denote this evaluation by putting them after semi-colon 
p(t; x, x’) to signify they are held constant, yielding a function of time alone. This 
function is the (monoaural) impulse response capturing the various acoustic path 
delays and amplitudes from the source to the listener via the scene. The vibrational 
aspects of how the source event generated the sound q’(t) are abstracted away— 
it may be synthesized at runtime, or read out from a pre-recorded file and freely 
substituted. 


3.3.3 Directional Impulse Response 


The directional impulse response d (t, s; x, x’) [32] generalizes the impulse response 
p(t; x, x) to include direction of arrival, s. Intuitively, it is the signal obtained by 
the listener if they were to point an ideal directional microphone in direction s when 
the source at x’ emits an isotropic impulse. 

Given a directional impulse response, spatialization for the listener can be per- 
formed to reproduce the directional listening experience via 


g’ t; x, x) = gine f d (t, s; x, x") * HO” (R7! (s), t) ds , (3.4) 
S2 


where HU” (s, t) are the left and right HRTFs of the listener as discussed in 
Sect. 3.2.4, R is a rotation matrix mapping from head to world coordinate system, 
and s € S? represents the space of incident spherical directions forming the inte- 
gration domain. Note the advantage of separating propagation (directional impulse 
response) from spatialization (HRTF application). The expensive simulation nec- 
essary for solving (3.1) can ignore the listener’s body entirely, which is inserted 
later taking its dynamic rotation R into account, via separately tabulated HRTFs as 
in (3.4). 
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Fig. 3.2 Bidirectional impulse response (BIR). An impulse radiates from source position x’, prop- 
agates through a scene, and arrives via two paths in this simple case at listener position x. The paths 
radiate in directions s and s/ and arrive from directions sı and s2, respectively, with delays based on 
the respective path lengths. The bidirectional impulse response (BIR) denoted by D(t, s, s’; x, x’) 
contains this time-dependent directional information. Evaluating for specific radiant and incoming 
directions isolates arrivals, as shown on the right. (figure adapted from [26]) 


3.3.4 Bidirectional Impulse Response (BIR) and Rendering 
Equation 


The above still leaves out direction-dependent radiation at the source. A complete 
description of auralization for localized sound sources can be achieved by the natural 
extension to the bidirectional impulse response (BIR) [26]; an 1 1-dimensional func- 
tion of the wave field, D(t, s, s’; x, x’), illustrated in Fig. 3.2. Analogous to the HRTF, 
the source’s radiation pattern is tabulated in a source directivity function (SDF), 
S(s’, t), such that its radiated signal in any direction s’ is given by q'(t) x S(t; 5’). 
We can now write the (binaural) rendering equation: 


qa a =g 


[fe (t, s, sr * S (RT (s"), t) * HP (R's), t) ds ds', 
(3.5) 


where R is a rotation matrix mapping from the listener’s head to the world coordinate 
system, R’ maps rotation from the source to the world coordinate system, and the 
double integral varies over the space of both incident and emitted directions s, s’ € 
S?. A similar formulation can be obtained for speaker-based rendering by using, for 
instance, VBAP speaker panning weights [72] instead of HRTFs. 

The BIR is convolved with the source’s and listener’s free-field directional 
responses S and H!."}, respectively, while accounting for their rotation since (s, s’) 
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are in world coordinates, to capture modification due to directional radiation and 
reception. The integral repeats this for all combinations of (s, s’), yielding the net 
binaural response. This is finally convolved with the emitted signal q’(t) to obtain a 
binaural output that should be delivered to the entrances of the listener’s ear canals. 
Finally, if multiple sound sources are present, this process is repeated for each source 
and the results are summed. 


Bidirectional decomposition and reciprocity. The bidirectional impulse response 
generalizes the more restrictive notions of impulse response in (3.4) and (3.3), illus- 
trated in Fig. 3.2. The directional impulse response can be obtained by integrating 
over all radiating directions s’ and yields directional effects to the listener for an 
omnidirectional source: 


d(t,s; x, x) = jl D(t, s,s’; x, x) ds". (3.6) 
S2 


Similarly, a subsequent integration over directions to the listener, s, yields back the 
monoaural impulse response, p(t; x, x’). 

The BIR admits direct geometric interpretation. With source and listener located 
at (x’, x), respectively, consider any pair of radiated and arrival directions (s’, s). 
In general, multiple paths connect these pairs, (x’, s’) ~> (x, s), with correspond- 
ing delays and amplitudes, all of which are captured by D(t, s, s’; x, x’). Figure3.2 
illustrates a simple case. The BIR is thus a fully reciprocal description of sound prop- 
agation within an arbitrary scene. Interchanging source and listener, all propagation 
paths reverse: 


D(t, s,s’; x, x’) = D(t, s', s; x', x). (3.7) 


This reciprocal symmetry mirrors that for the underlying wave field, p(t; x, x’) = 
p(t; x’, x) and requires a full bidirectional description. In particular, the directional 
impulse response is non-reciprocal. 


3.3.5  Band-limitation and the Diffraction Limit 


It is important to remember that the bidirectional impulse response is a mathemati- 
cally convenient intermediate representation only, and cannot be realized physically. 
The only physically observed quantity is the final rendered audio, qÙ” (t; x, x’). 
In particular, the BIR representation allows unlimited resolution in time and direc- 
tion. The source signal, qg’(t), is temporally band-limited for typical sounds, due to 
aggressive absorption in solid media and air as frequency increases. Similarly, audi- 
tory perception is limited to 20 kHz. Band-limitation holds for directional resolution 
as well because of the diffraction limit [16] which places a fundamental restriction 
on the angular resolution achievable with a spatially finite radiator or receiver. 
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For a propagating wavelength À, the diffraction-limited angular resolution scales 
as D/à, where D is the diameter of an enclosing sphere, such as around a radiating 
object, or the listener’s head and shoulders in the case of HRTFs [105]. Therefore, all 
the convolutions and spherical quadratures in (3.5) may be performed on a discretiza- 
tion with sufficient sub-wavelength resolution at the highest frequency of interest. 
Alternatively, it is common to perform time convolutions in frequency domain via 
the Fast Fourier Transform (FFT) for efficiency. Similarly, spherical harmonics (SH) 
form an orthonormal linear basis over the sphere and can be used to accelerate the 
spherical quadrature of function product to an inner product of spherical harmonic 
(SH) coefficients. An end-to-end auralization system using this approach was shown 
in [63]. 


3.4 Structure and Perception of the Bidirectional Impulse 
Response (BIR) 


To explain how the theory outlined above can be put into practice, we will first review 
the physical and perceptual structure of the BIR, followed by a discussion of how 
auralization systems approximate in various ways. 


3.4.1 Physical Structure 


The structure of a typical (bidirectional) impulse response may be understood in three 
phases in time, as illustrated in Fig. 3.3. First, the emitted sound must propagate via the 
shortest path, potentially diffracting around obstruction edges to reach the listener 
after some onset delay. This is the initial (or “direct”) sound. The initial sound is 
followed by early reflections due to scattering and reflection from scene geometry. 
As sound continues to scatter multiple times from the scene, the temporal arrival 
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Fig. 3.3 Structure of the bidirectional impulse response (figure adapted from [26]) 
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density of reflections increases, while the energy of an individual arrival decreases 
due to absorption at material boundaries and in the air. Over time, with sufficient 
scattering, the response approaches decaying Gaussian noise, which is referred to 
as late reverberation. The transition from early reflections to late reverberation is 
demarcated by the mixing time [1, 98]. 

As we discuss next, each of these phases has a distinct contribution to the overall 
spatial perception of a sound. These properties of the human auditory perception 
play a key role in informing how one might approximate the rendering equation (3.5) 
within limited computational resources, while still retaining an immersive auditory 
experience. A more detailed review of perception of room acoustics can be found 
in [37] and [60]. All observations and terms below can be found in these references, 
unless otherwise noted. 


3.4.2 Initial (“Direct”) Sound 


Our perception strongly relies on the initial sound to localize sound sources, a phe- 
nomenon called the precedence effect [62]. Referring to Fig. 3.3, if there is a sec- 
ondary arrival that is roughly within 1 ms of the initial sound, we perceive a direction 
intermediate between the two arrival directions, termed summing localization, rep- 
resenting the temporal resolution of spatial hearing. Beyond this | ms time window, 
our perceptual system exerts a strongly non-linear suppression effect, so people do 
not confuse the direction of strong reflections with the true heading of the sound. 
Sometimes called the Haas effect, a later arrival may need to be as much as 10 dB 
louder than the initial sound to affect the perceived direction significantly. Note that 
this is not to say that the later arrival is not perceived at all, only that its effect is not 
to substantially change the localized direction. 

Consider the case shown in Fig. 3.3, and assume the walls do not substantially 
transmit sound. The sound shown inside the room would be localized by the listener 
outside as arriving from the direction of the doorway, rather than the line of sight. 
Such cues are a natural part of how we navigate to visually occluded events in 
everyday life. The upshot is that in virtual reality, the initial sound path may be 
multiply-diffracted and must be modeled with particular care so that the user gets 
localization cues consistent with the virtual world. 


3.4.3 Early Reflections 


Early reflections directly affect the perception of source properties such as loud- 
ness, width, and distance while also informing the listener about surrounding scene 
geometry such as nearby reflectors. A copy of a sound following the initial arrival 
is perceptually fused up until a delay called the echo threshold, beyond which it is 
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perceived as a separate auditory event. The echo threshold varies between 10 ms 
for impulsive sounds, through 50 ms for speech to 80 ms for orchestral music [62, 
Table 1]. 

The impact of the loudness of early reflections is important in two ways. Firstly, 
the perception of source distance is known to correlate with the energy ratio between 
initial sound and remaining response (whose energy mostly comes from early reflec- 
tions), called the direct-to-reverberant ratio (DRR) [92]. This is often also called the 
“wet ratio” by audio designers. Secondly, how well one can understand and localize 
sounds depends on the ratio of the energy of direct sound and early reflection in the 
first 50 ms to the rest of the response, as measured by clarity (C50). 

The directional distribution of reflections conveys important detail about the size 
and shape of the local environment around the listener and source. The ratio of 
reflected energy arriving horizontally and perpendicular to the initial sound is called 
lateral energy fraction and contributes to the perception of spaciousness and affects 
the apparent source width. Further, in VR, strong individual reflections from surfaces 
close to the listener provide an important proximity cue [69]. 

Thus, an auralization system must model strong initial reflections as well as the 
aggregate energy and directionality of later reflections up to the first 80 ms to ensure 
important cues about the sound source and environment are conveyed. 


3.4.4 Late Reverberation 


The reverberation time, 7¢o, is the time taken by the reverberant energy to decay 
by 60 dB. Since the reverberation contains numerous, lengthy paths through the 
scene, it provides a sense of the overall scene, such as its size. The Teo is frequency- 
dependent; the relative decay rate across various frequencies informs the listener 
about the acoustic materials in a scene and atmospheric absorption. 

The aggregate directional properties of reverberation affect listener envelopment 
which is the perception of being present in a room and immersed in its reverberant 
field (see Chap. 11 and Sect. 11.4.3 for further discussions on related topics). In 
virtual reality, one may often be present outside a room containing sounds and any 
implausible envelopment becomes especially distracting. For instance, consider the 
situation in Fig. 3.3—rendering an enveloping room reverberation for the listener 
will sound wrong, since the expectation would be low envelopment. 


3.5 System Design Considerations for VR Auralization 


Many types of real-time auralization systems exist today that approximate the ren- 
dering equation (3.5), and in particular, how to evaluate the scene’s sound propaga- 
tion (i.e., the BIR, D(t, s, s’; x, x’)) which is typically the most compute-intensive 


92 N. Raghuvanshi and H. Gamper 


portion. They gain efficiency by making approximations based on the intended appli- 
cation, with a knowledge of the limits of auditory perception. 


3.5.1 Room Auralization 


The roots of auralization research lie in the area of computational modeling of room 
acoustics, an active area of research with developments dating back at least 50 
years [7, 60]. The main objective of these computer models has been to aid in the 
architectural design of enclosures, such as offices, classrooms, and concert halls. The 
predictions of these models can then be used by acousticians to propose architec- 
tural design changes or acoustic treatments to improve the reverberant properties of 
a particular room or hall, such as speech intelligibility in a classroom. This requires 
models that simulate the room’s first reflections and reverberation with perceptual 
authenticity. The direct path in such applications can often be computed analytically 
since the line of sight is rarely blocked. We direct the reader to Gade’s book chapter 
[37] on the subject of room acoustics for an excellent summary of the requirements, 
metrics, and methods in the field from the viewpoint of concert hall design. 

While initially the computer models could only produce quantitative estimates 
of room acoustic parameters, with increasing compute power, real-time auralization 
systems were proposed near the beginning of the millennium [86]. As we will discuss 
in more detail shortly, geometric methods are standard in the area today because they 
are especially well-suited for modeling a single enclosure where visual occlusion 
between sounds and listener is not dominant. This holds very well in any hall designed 
for speech or music. Room auralization is available today in commercial packages 
such as ODEON [82] and CATT [28]. 


3.5.2 VR Auralization 


The concerns of real-time VR auralization are quite distinct along a number of 
dimensions, which result from going from individual room to a scene that can span 
entire city blocks with numerous indoor and outdoor areas. This results in a unique 
set of considerations that we enumerate below, for two reasons. Firstly, they provide 
a framing for understanding current research in the area and the trade-offs current 
systems make, which we will discuss in the following sections. Secondly, we hope 
that the concise listing of practical problems motivates new research in the area, as 
no system today can meet all these criteria. 


1. Real time within limited computation. A VR application’s auralization com- 
ponent can usually only use a single or a few CPU cores for audio simulation 
at runtime, since resources must be shared with simulating other aspects of the 
world, such as rigid-body collisions, character animation, and AI path plan- 
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ning. In contrast, owing to the application, in room acoustic auralization one can 
consume a majority of the resources of a computer including the parallel com- 
pute power of modern graphics cards. With power-efficient mobile processors 
integrated into phones and standalone head-mounted displays, the pressure to 
minimize computation has only increased. 

2. Scene complexity and non line of sight. Room acoustics theory often starts by 
assuming a single connected space such as a concert hall that has lines of sight 
from the stage to all listener locations. This allows for a powerful simplification 
of the sound field as an analytically computable direct sound combined with a 
diffuse reverberant field. Modern VR systems for building and game acoustics 
consider the much broader class of all scenes such as a building floor with many 
rooms, or a street canyon with buildings that may be entered. These are complex 
scenes not just in the sense of surface detail but also in that the air volume 
is topologically complex, with many concavities. As a result, non line of sight 
cases are common. For instance, hearing sounds in the same room with plausible 
reverberation can be as important as not hearing sounds inside another room, or 
hearing sounds from unseen sources diffracted around a corner or door. 

3. Perception. Physical accuracy is important to VR auralization not as a goal in 
itself but rather in so far as it impacts sensory immersion. This opens opportu- 
nities for fast approximations, and deeply informs practical systems that scale 
their errors based on the acuity of the human auditory system. This observa- 
tion underlies the deterministic-statistical decomposition discussed in the next 
section. Further, in many applications such as games, plausibility can be suffi- 
cient as a starting point, while for instance in auralizing building acoustics one 
might need perceptual authenticity. 

4. Dynamic sounds. VR auralization must often support dynamic sound sources 
that can translate and rotate. The rendering must respond with low latency and 
without distracting artifacts, even for fast source motion. This adds significant 
complexity to a minimum-viable practical system. However, in architectural 
acoustic systems, static sound sources can be a feasible starting point. 

5. Dynamic geometry. In many applications, the scene geometry can be changed 
interactively. This may be while designing a virtual space, in which case an 
acoustical system for static scenes may re-compute on the updated geometry; 
depending on the system this can take seconds to hours. The more challenging 
case is when the geometry is changing in real time. The change might be “locally 
dynamic”, such as opening a door or moving an obstruction. Since such changes 
are localized in an otherwise static scene, many systems are able to model such 
effects. Lastly, the scene may be “globally dynamic”, where there might be 
unpredictable global changes, such as when a game player creates a building in 
Minecraft or Fortnite and expects to hear the audio rendering adapt to it in real 
time—while this has the most practical utility it is also the most challenging 
case. 

6. Robustness. VR requires high robustness given unpredictable user inputs. This 
means the severity and frequency of large outlying errors may matter more 
than average error. For instance, as the listener moves quickly through a scene 
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through multiple rooms, the variation in reverberation and diffracted occlusion 
must stay smooth reliably. This is a tightly restrictive constraint: a technique 
that has large outlying errors may not be viable in immersive VR regardless of 
its average error. As an example, an implausible error in calculating occlusion 
with only 0.1% probability for an experience running at 30 frames per second 
means distracting the user every 33 s on average. This deteriorates to 3.3 s with 
10 sound sources and so on. 


. Scalability. The system should ideally expose compute-quality trade-offs along 


two axes. Firstly, VR scenes can contain hundreds to thousands of dynamic sound 
sources, and it is desirable if the signal processing can scale from high-quality 
rendering of a few sound sources to lower quality (but still plausible) render- 
ing for numerous sound sources. Secondly, the acoustical simulation should 
also allow methods for reducing quality gracefully as scene size increases. For 
instance, high-quality propagation modeling of a conference room, up to a rough 
simulating of a city. 


. Automation. For VR applications, it is preferable to avoid any per-scene man- 


ual work, such as geometric scene simplification. Game scenes in particular can 
span over kilometers with multiple buildings designed iteratively during the pro- 
duction process. This makes manual simplification a major hurdle for practical 
usage. The auralization system must ideally directly ingest complex scenes with 
millions of polygons, and perform any necessary simplification while minimiz- 
ing any human expertise or input, unlike in room auralization. 

Artistic direction. VR often requires the final rendering to be controlled by 
a sound designer. For instance, the reverberation and diffracted occlusion on 
important dialogue might be reduced to boost speech intelligibility in a game. 
Or one might want to re-map the dynamic range of the audio rendering with 
the limits of the audio reproduction system or user comfort in mind. A viable 
system must provide methods that allow such design intent to be expressed and 
influence the auralization process appropriately. 


3.6 Rendering the BIR: the Deterministic-Statistical 


Decomposition 


A powerful technique employed by most real-time auralization systems is to decom- 


pose the BIR as a sum of a deterministic and statistical component. This is deeply 


informed by acoustical perception (Sect. 3.4) and is key to enabling the computational 
trade-offs VR auralization must contend with, as described in the prior section. The 
initial sound and strong early reflections, such as sound heard via a portal or echoes 
heard from nearby large surfaces, are treated deterministically: that is, simulated and 


rendered in physical detail, and updated in real time based on the dynamic source and 


listener pose and scene geometry. Weak early reflections and late reverberation are 
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represented only statistically, ignoring the precise details of each of the amplitudes 
and delays of thousands of arrivals or more, which are perceived in aggregate. 
To formalize, the BIR is decomposed as 


D(t, s,s"; x, x) = Dalt, s, 8'3 x, x) + Dy(t, s,s"; x, x’). (3.8) 


Referring to Fig. 3.3, the initial sound and early reflection spikes deemed perceptually 

salient can be included accurately in D4. The residual is D,, which is usually modeled 

as noise characterized by its perceptually relevant statistical properties. 
Substituting into the rendering equation (3.5) and observing linearity, we have 


q"” t; x, x') = 5 q' (t) * ff Pus x S (RT's), t) * HH” (R7! (s), t) ds ds', 
{d,s} 

(3.9) 
so that the input mono signal, q'(t), is split off as input into separate filtering pro- 
cesses for the two components, whose binaural outputs are summed. This is a fairly 
standard architecture followed by both research and commercial systems, as the two 
components may be approximated independently with perception and the particu- 
lar application in mind. For the remainder of this section, we will assume the BIR 
components have been computed and focus on the signal processing for rendering. 
The next section will discuss how this decomposition informs the design of acoustic 
simulation methods. 


3.6.1 Deterministic Component, Da 


The deterministic component, D4, is typically represented as a set of ny peaks: 


na—1 
Dalt, s,s; x, X) X > aj(t) x (t — Ti) d(s’ — si) 8(s — si). (3.10) 
i=0 


Each term represents an echo of the emitted impulse that arrives at the listener position 
after a delay of t; from world direction s;, having been previously radiated from 
the source in world direction s; at time t = 0. The amplitude filter a;(t) captures 
transport effects along the path from edge diffraction, scattering, and frequency- 
dependent transmission/reflection from scene geometry. Note that the amplitude 
filter is causal, i.e., a;(t) = 0 for t < 0, and by convention 1;,; > t;. The parameter 
nq is key for trading between rendering quality and computational resources. It is 
usual to at least treat the initial sound path deterministically (i.e., ng => 1) because of 
its high importance for localization due to the Precedence Effect. Audio engines will 
usually designate this (i = 0) as the “dry” path with separate design controls due to 
its perceptual importance. 
Substituting from (3.10) into Eq. (3.9), we get 
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na—1 
qE O = YO O) * EE T) x ai (E) * S (R), t)* HH (Ros), t). 

= (3.11) 
Thus, each path’s processing is a linear filter chain whose binaural output is summed 
to render the deterministic component to the listener. Reading the equation from left 
to right: for each path, take the monophonic source signal and input it to a delay line. 
Read the delay line at (fractional) delay t; and filter the output based on amplitude 
filter a;, then filter it based on the source’s radiation pattern. The lookup via R’~'(s/) 
signifies that one must rotate the radiant direction of the path from world space to 
the local coordinate system of the source’s spherical radiation pattern data. 

Finally, the last factor makes concrete the modularity shown in Fig. 3.1: the result- 
ing monophonic signal from this prior processing is sent to the spatializer module 
as arriving from direction R~'(s;) relative to the listener. One is free to substitute 
any spatializer to separately trade off quality and speed of spatialization versus other 
costs and priorities for the system. One could even use multiple spatialization tech- 
niques, such as high-quality spatialization for the initial path, and lower fidelity for 
reflections. In a software implementation, the spatializer often acts as a sink for 
monophonic signals, processing each, mixing their outputs, and sending them to a 
low-level audio engine for transmission to transducers, thus performing the summa- 
tion in (3.11) as well. 

Similar to the choice of spatializer, the details of all other filtering operations 
are highly flexible. For the amplitude filter a;, the simplest realization is to multi- 
ply by a scalar for average magnitude over frequencies, thus representing arrivals 
with idealized Dirac spikes. But for the initial sound filter ag, even in a minimal- 
istic setting it is common to apply a low-pass filter to capture the audible muf- 
fling of visually occluded sounds. A more accurate implementation accounting for 
frequency-dependent boundary impedance could use equalization filters in octave 
bands. For source directivity, it is common to measure and store radiation patterns 
as third-octave or octave-band data tabulated over the sphere of directions while 
ignoring phase. Convolution can then be realized via modern fast graphic equalizer 
algorithms that employ recursive time-domain filters [68]. 

The commutative and associative properties of convolution are a powerful tool 
to optimize signal processing. The ordering of filters in (3.11) has been chosen to 
illustrate this. The delay is applied in the very first operation. This makes it so that 
we only need one single-write-multiple-read delay line shared across all paths. The 
signal q'(t) is written as input, and each path reads out at delay t;. This is acommonly 
used optimization. Further, one may then use the associative property to group the 
factors: a;(t) * S (R! (s), t). If both are implemented, say, using an octave-band 
graphic equalizer, then the per-band amplitudes can be multiplied first and provided 
to a single instance of the equalizer—a nearly two-fold reduction in equalization 
compute. These optimizations illustrate the importance of linearity and modularity 
in the efficient implementation of auralization systems. 
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3.6.2 Statistical Component, Ds 


The central concept for rendering the statistical component, D,, is to use an analysis- 
synthesis approach [56]. The analysis phase does lossy perceptual coding of the 
statistical component of the BIR, D,, to compute D, as the energy envelope of the 
response summing over time, frequency, and direction. We use the over-bar notation 
f (¥) to indicate that y is sub-sampled, and f’s corresponding energy is appropriately 
summed at each sample of y without loss via some windowing. For instance, if p(t) 
is an impulse response, p(f) indicates the corresponding echogram, which is the 
histogram of p*(t) sampled at some time-bin centers, f. This notation is introduced to 
indicate the reduction in the sampling rate of y, and loss of fine structure information 
in f at its original sampling rate, such as phase. 


Parametric reverberation. During real-time rendering, the description captured in 
D, can be synthesized using fast parametric reverberation techniques: the “param- 
eters” being statistical properties that determine D,, as we will discuss. The key 
advantage is that since the fine structure of the response in time, frequency, and 
direction is left unspecified, one has vast freedom in choosing efficient techniques. 
These techniques often rely on recursive time-domain filtering which can potentially 
make the CPU cost far smaller than applying a few seconds long filter via frequency- 
domain convolution. The research problem is to make the artificial reverberation 
sound natural. Among other concerns, the produced reverberation must have realis- 
tically high temporal echo density and sound colorless, not introducing perceivable 
spectral or temporal modulations that cannot be controlled. For further reading, we 
point readers to the extensive survey in [99]. In the following, we focus on how one 
might characterize D,. 


Energy Decay Relief (EDR). The EDR [56] is a central concept for statistical encod- 
ing of acoustical responses. Consider a monoaural impulse response, p(t). The EDR, 
P(t, ©), is computed by performing short-time Fourier analysis on p(t) to compute 
how its energy spectral density integrated over perceptual frequency bands with cen- 
ters @ varies over time-bin centers f. It can be visualized as a spectrogram. Frequency 
dependence results from materials of the boundary (e.g., wood tends to be more 
absorbent at high frequencies compared to concrete) and atmospheric absorption. 
Frequency band centers are typically spaced by octaves for real-time auralization, 
and time bins typically have a width of around 10 ms. 

The reduced sampling rate makes the EDR, p, already quite compact compared 
to p, which is a highly oscillatory noisy signal at audio sample rates. Further, the 
EDR is smooth in time: it exhibits slow variation during early reflections (especially 
if the strong peaks have been separated out already into Dz) followed by monotonic 
decay during late reverberation. This opens up many avenues for a low-dimensional 
description with a few parameters. For instance, for a single enclosure, the EDR in 
each frequency band may be well-approximated by an exponential decay, resulting in 
a compact description for the late reverberation parameterized by the initial energy, 
Po, and 60-dB decay time, Teo in each frequency band: 
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Apart from substantial further compression, the great advantage of such a paramet- 
ric description is that it is easy to interpret, allowing artistic direction. Reverberation 
plugins will typically provide po as a combination of a broadband “wet gain” and 
a graphic equalizer, as well as the decay times, T¢9(@) over frequency bands. For 
interactive auralization, the artist can exert aesthetic control by the simple means 
of modifying the reverberation parameters produced from acoustic simulation. For 
instance, when the player enters a narrow tunnel in VR, footsteps might get a realistic 
initial power (po) to convey the constricted space, yet speech might have the wet gain 
reduced to increase the clarity (Cso) and improve the intelligibility of dialogue. 


Bidirectional EDR. For an enclosure where conditions approach ideal diffuse rever- 
beration, the EDR can be a sufficient description. Parametric reverberators will typ- 
ically ensure that the same EDR is realized at both the ears but that the fine structure 
is mutually decorrelated, so that the reverberation is perceived by the listener as 
outside their head. However, in VR applications it becomes important to model the 
directionality inherent in reverberation because it can become strongly anisotropic. 
For instance, a visually occluded sound in another room heard through a door will 
be temporally diffuse, but directionally localized towards the door. 

The concept of EDR can be extended naturally to the bidirectional EDR, 
D, (f, @, 5, s’; x, x’), which adds dependence on direction for both source and lis- 
tener. It can be constructed and interpreted as follows. Consider a source located at x’ 
that radiates a Dirac impulse in a beam centered around directional bin center s’. After 
propagating through the scene, it is received by the listener at location x, who beam- 
forms in the direction s and then computes the EDR on the received time-dependent 
signal. The bidirectional EDR thus captures the frequency-dependent energy decay 
for all direction-bin pairs {s, s'}. 

Invoking the exponential decay model, the bidirectional EDR may be approxi- 
mated as 


Ds (f, ©, 5, S1; x, x") & po(@, 5, 8's x, x1) 1076 T0055), (3.13) 


Due to the curse of dimensionality, simulating and rendering the bidirectional EDR 
can get quite costly despite the simplifications. In practice, one must choose the sam- 
pling resolution of all the parameters judiciously depending on the application. An 
extreme case of this is when we sum over the entire range of a parameter, effectively 
removing it as a dimension. 

Let’s consider one example that illustrates the kind of trade-offs offered by sta- 
tistical modeling in balancing rendering quality and computational complexity. One 
may profitably compute the Teo for energy summed over all listener directions s, and 
source directions s’, which amounts to computing the monophonic EDR to derive 
the reverberation time. In that case, one obtains a simplified hybrid approximation: 


DoF, &, 5, 8; xx!) © Po(@, 5, 8; x, x/)10- 8 Toi) (3.14) 
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The first factor still captures strong anisotropy in reverberant energy, such as reverber- 
ation heard by a listener as streaming from a portal, or reverberant power being higher 
when a human speaker faces a close by reverberant chamber rather than away. In fact, 
Po(@, 5, 8’; x, x’) can be understood as a multiple-input-multiple-output (MIMO) 
frequency-dependent transfer matrix for incoherent energy between a source and 
receiver for directional channels sampled via s’ and s, respectively. The approxima- 
tion lies in the second factor—directionally varying decay times for a single sound 
source are not modeled, which may be quite subtle to perceive in many cases. 


3.7 Computing the BIR 


Acoustic simulation is the key computationally expensive task in modern auralization 
systems due to the high complexity of today’s virtual scenes. In particular, at every 
visual frame, for all source and listener pairs with locations (x, x’), the system must 
compute the BIR D(t, s,s’; x, x’), which may then be applied on each source’s audio 
as discussed in the prior section. There are two distinct ways the problem may be 
approached: geometric and wave-based methods. In this section, we will discuss the 
fundamental ideas behind these techniques. 


3.7.1 Geometric Acoustics (GA) 


Geometric methods approximate sound propagation via the zero-wavelength (infi- 
nite frequency) asymptotic limit of the wave equation (3.1). Borrowing terminology 
from fluid mechanics, this yields a Lagrangian approach, where packets of energy 
are tracked explicitly through the scene as they travel along rays and repeatedly 
scatter into multiple packets in all directions each time they hit the scene boundary. 
The key strength of geometric methods is speed and flexibility: compared to a full- 
bandwidth wave simulation, tracing rays can be much cheaper, and it is much easier 
to incorporate physical phenomena and construct the BIR, assembled by explicitly 
constructing paths connecting source to listener. Today, these methods are standard 
in the area of room auralization. 

Their key challenge falls into two categories. Firstly, one must efficiently search 
for paths connecting source to listener via complex scenes. Searching costs compu- 
tation. Doing too little can under-sample the response, causing audible jumps in the 
rendering. Secondly, diffraction at audible wavelengths must be considered explicitly 
(since it is not present by default) to ensure plausibility. Both must be incorporated 
while balancing smooth rendering for moving sources and listener against the CPU 
cost of geometric analysis inherent in path search. 

Below, we briefly elaborate on the general design of GA systems and practical 
implications for VR auralization, and refer the reader to Savioja and Svensson’s 
excellent survey on the recent developments in GA techniques [87]. 
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Simplified geometry. Due to the zero-wavelength approximation, geometric meth- 
ods remain sensitive to geometric detail indefinitely below audible wavelengths. For 
instance, if one directly used a visual mesh for GA simulation, a coffee mug can 
create a strong audible echo if the source and listener are connected by a specu- 
lar reflection path hitting the cup. Such specular glints are observed for light, but 
not sound with its much longer wavelength. So, it becomes important to build an 
equivalent simplified acoustical model of the scene which captures only large facets, 
combined with coefficients that summarize scattering due to diffraction. For instance, 
the seating area in a concert hall might be replaced with an enclosing box with an 
equivalent scattering coefficient. This process requires the user to have a degree of 
acoustical expertise, and inaccuracies can result without carefully specified geom- 
etry and boundary data [21]. However, for VR auralization, automation is highly 
desirable, with some recent work along these lines [88]. 


Deterministic-statistical decomposition. Geometric methods directly incorporate 
the deterministic-statistical decomposition in the simulation process to reduce CPU 
burden. In particular, the two components Dy and D, are typically computed and 
rendered separately and then mixed in the final rendering to balance quality and 
speed. 

GA methods perform a deterministic path search only up to a certain number of 
bounces on the scene boundary, called the reflection order. This is a key parameter 
for GA systems because it has a sensitive impact on both performance and render- 
ing quality, varying by system and application. Typically, the user can specify this 
parameter, which then implicitly determines the number of deterministic peaks ren- 
dered, n4, in (3.10). To accelerate path search, early methods [86] proposed using 
the image source method [7], which is well-suited for single enclosures but scales 
exponentially with reflection order and does not account for edge diffraction. 

Following work on beam tracing, [36] showed that in multi-room scenes, pre- 
computing a beam-tree data structure can at once control the exponential scaling 
and also incorporate edge diffraction which is crucial for plausibility in such densely 
occluded scenes. The system introduced precomputation as a powerful technique 
for reducing runtime acoustics computation, which most modern systems employ at 
least to some degree. 

A key general concept employed in the beam tracing work in [36] is the room- 
portal decomposition: an indoor scene with many rooms is approximately decom- 
posed into a set of Simplicial convex shapes that represent room volume, connected 
by flat portals representing doors. This is a frequently used method in GA systems, as 
it allows efficient deterministic path search on the discrete graph formed by rooms as 
nodes and portals as connecting edges. However, room-portal decomposition does 
not generalize to outdoor or mixed scenes, which is a key limitation that recent 
research is focusing on to allow fast deterministic search of high-order diffraction 
paths [34, 88]. 

Techniques developed for light transport in the computer graphics community are 
a great fit for computing the statistical component owing to its phase incoherence. 
Many methods are possible, such as those based on radiosity [8, 93]. Stochastic 
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path tracing is a standard method in both graphics and acoustics communities today, 
used originally by DIVA [86] and in modern systems like RAVEN [90]. More recent 
improvements use bidirectional path tracing [24], which directly exploits the bidi- 
rectional reciprocity principle (3.7) to accelerate computation. 

GA methods cannot construct the fine structure of the reverberant portion of the 
response, but as we discussed in Sect. 3.6.2, it is often sufficient to build the bidirec- 
tional energy decay relief, D, (t, @, 5, s’; x, x’), or some lower dimensional approx- 
imation ignoring directionality. With path tracing techniques, this is directly accom- 
plished by accumulating into a histogram indexed on all the function parameters— 
each path represents an energy packet that accumulates into its corresponding his- 
togram bin. The key parameter trading quality and cost is the number of paths sampled 
so that the energy value in each histogram bin is sufficiently converged. 

With simplified scenes admitting a room-portal decomposition one can expect 
robust convergence, or even use approximations that avoid path tracing altogether 
[94], but for path tracing in complex VR scenes, the required number of paths for a 
converged histogram can vary significantly based on source and listener locations, 
{x, x’}. For instance, if they are connected only through a few narrow apertures in the 
scene, it can be hard to find connecting paths despite extensive random searching. 
There is precedence for such issues in computer graphics as well [101], representing 
a frontier for new research with systematic convergence studies, as initiated in [24]. 


3.7.2 Wave Acoustics (WA) 


Wave acoustic methods take an Eulerian approach: space time is discretized onto a 
fictitious background, such as a uniform discrete grid, and then one updates pressure 
amplitude in each cell at each time-step. Paths are not constructed explicitly, so as 
energy scatters in various directions from scene surfaces, the amount of information 
tracked does not change. Thus, arbitrary combinations of diffraction and scattering 
are naturally captured by wave methods. By running a simulation with a source 
located at x’, a discrete approximation of Green’s function p(t, x; x’) is directly 
produced by running a volumetric simulation for a sufficient duration. The BIR 
D(t; s, s', x, x’) may then be computed via accurate plane-wave decomposition in 
a volume centered at the source and listener location [2, 91] or via the much faster 
approximation using instantaneous flux density [26], first applied to audio coding 
in [74]. 


Numerical solvers. The main challenge of wave methods is their computational cost. 
Since wave solvers directly resolve the detailed wave field by discretizing space and 
time, their cost scales as the fourth power of the maximum simulated frequency and 
third power of the scene diameter, due to Nyquist criteria as outlined in Sect. 3.2.1. 
This made them outright infeasible for most practical uses until the last decade, 
apart from low-frequency modal simulations up to a few hundred Hertz. However, 
they have seen a resurgence of interest over the last decade, with many kinds of 
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solvers being actively researched today for auralization, such as spectral methods [52, 
77), finite difference methods [49, 85], and the finite element method [71, 103]. 
Alongside the progress in numerical methods, the increased computational power of 
CPUs and graphics processors, as well as the availability of increased RAM, now 
allows simulations of practical cases of interest, such as concert halls, up to mid- 
frequencies (1 kHz and beyond). This is still short of complete audible bandwidth, 
and it is common to use approximate extrapolation beyond the band-limit frequency. 
The compute times remain suitable only for off-line computation, ranging in a few 
hours. The availability of commodity cloud computation has further aided the wider 
applicability of wave methods despite the cost. 


Precomputation and static scenes. The idea of precomputation has been central 
to the increasing application of wave methods in VR auralization. Real-time aural- 
ization with wave methods was first shown to be viable for complex scenes in [80]. 
The method performs multiple simulations off-line and the resulting (monophonic) 
impulse responses are encoded and stored in a file. At runtime, this file is loaded, 
and the sampled acoustical data are spatially interpolated for a dynamic source and 
listener which informs spatialization of the source audio. This overall architecture is 
followed by most wave-based auralization methods. 

The disadvantage of precomputation is that it is limited to static scenes. However, 
it has the great benefit that the fidelity of acoustical simulation becomes decoupled 
from runtime CPU usage. One may perform a detailed simulation directly on complex 
scene geometry ensuring robust results at runtime. These trade-offs are highly analo- 
gous to “light baking” which is a common feature of game engines today: expensive 
global illumination is simulated beforehand on static scenes to ensure fast runtime 
rendering. Similar to developments in lighting, one can conceivably incorporate local 
dynamism such as additional occlusion from portals [76] or moving objects [84] in 
the future. 


Parametric encoding. The key research challenge introduced by precomputation is 
that the BIR field D(t, s, s’, x, x’) is 11-dimensional and highly oscillatory. Capturing 
it in detail can easily take an impractical amount of storage. Spatial audio coding 
methods such as DirAC [73, 74] demonstrate a path forward, in that they extract and 
render perceptual properties from directional audio recordings rather than trying to 
re-create the physical sound field. This in turn is similar in spirit to audio coding 
methods such as MP3 where precise waveform reconstruction is eschewed in favor 
of controllable trade-offs between perceived quality and compressed size. 

These observations have motivated a new thread of auralization research on wave- 
based parametric methods [26, 78, 79] that combine precomputed wave acoustics 
with compact, perceptual coding of the resulting BIR fields. Such methods are prac- 
tical enough today to be employed in many gaming applications. The deterministic- 
statistical decomposition plays a crucial role in this encoding stage, as we will elab- 
orate in Sect. 3.8.4 when we discuss [26] in more detail. 


Physical encoding. In a parallel thread, there has been work on methods that directly 
approximate and convolve the complete BIR without involving perceptual coding. 
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The equivalent source method was proposed in [63, 64], at the expense of restricting 
to scenes that are a sparse set of exterior-scattering building facades. More recent 
methods for high-quality building auralization have been developed, which sample 
and interpolate BIRs for dynamic rendering [55]. The advantage is that no inherent 
assumptions are made about the perception or the structure of the BIR, but in turn, 
such systems tend to be more expensive and current technology is limited to static 
sound sources. 


3.8 Auralization Systems 


In this section, we will discuss a few illustrative example systems in more detail. We 
emphasize that this should not be interpreted as a representative survey. Instead, our 
aim is to illustrate how the design of practical systems can vary widely depending on 
the intended application, chosen algorithms, and in particular how systems choose 
to prioritize a subset of the design constraints (Sect. 3.5). Most of these systems are 
available for download and experimentation. 


3.8.1 Room Acoustics for Virtual Environments (RAVEN) 


RAVEN [90] is a research system built from the ground up aiming for perceptually 
authentic and real-time auralization in VR. The computational budget is thus on the 
high side, such as all the resources of a single or few networked computers. This is 
in line with the intended application: for an acoustician evaluating a planned design, 
it is more important to hear a result with reliable predictive value, and the precise 
amount of computation does not matter as long as it is real time. RAVEN is a great 
example of the archetypal decisions involved in the end-to-end design of modern 
real-time geometric systems. 

A key assumption in the system is that the scene is a typical building floor. Many 
decisions and efficiencies flow naturally. Chiefly, one can employ the room-portal 
decomposition as discussed in Sect. 3.7.1. Local scene dynamism is also allowed by 
the system, such as opening or closing doors, with limited precomputation on the 
scene geometry. However, like most geometric acoustic systems, the scene geometry 
has to be manually simplified with acoustical expertise to achieve the simplified cells 
required by rooms and portals. Flexible signal processing that can include artistic 
design need not be considered, since the application is physical prediction. 

RAVEN models diffraction on both the deterministic and statistical components 
of the BIR. The former uses the image source method, with reflection orders up to 3 
for real-time evaluation. Edge sources are introduced to account for diffraction paths 
that, e.g., first undergo a bounce from a flat surface and then diffract around a portal 
edge. Capturing such effects is especially important for smooth results on dynamic 
source and listener motion, which RAVEN carefully models. 
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The statistical component uses stochastic ray tracing with improved convergence 
using the “diffuse rain” technique [90]. To model diffraction for reverberation, a 
probabilistic scheme is used [95] that deflects rays that pass close enough to scene 
edges. Since the precise reconstruction of the reverberant characteristics is of central 
importance in architectural acoustics, RAVEN models the complete bidirectional 
energy decay relief, as illustrated in [90, Fig. 5.19]. 


3.8.2 Wwise Spatial Audio 


Audiokinetic’s Wwise [9] is a commonly employed audio engine in video games, 
alongside many other audio design applications. Wwise provides both geometric 
acoustical simulation and HRTF spatialization using either object-based or spherical- 
harmonic processing (Sect. 3.2.4). The system stands in illustrative contrast to 
RAVEN, showing how different application needs can deeply shape technical choices 
of auralization systems. A detailed description of ideas and motivation can be found 
in the series of white papers [23]. 

Gaming applications require very low CPU utilization (fraction of a single core) 
without requiring physical accuracy. But one needs to approximate carefully. The 
rendering must stay perceptually believable, such as smooth acoustic changes on 
fast source motion or visual occlusion. Minimizing precomputation is desirable for 
reducing artist iteration times. Finally, the ability of artists to interpret the acoustic 
simulation and design the rendered output is paramount. 

To meet these goals, Wwise also starts with a deterministic-statistical decom- 
position. Like most geometric systems, the user must provide a simplified audio 
geometry for the scene, which is the bulk of the work. Once this is done, the system 
responds interactively without precomputation. The initial sound is derived based 
on an explicit path search on simplified geometry at runtime, with reflections mod- 
eled via image sources up to some user-controlled reflection order (usually ~3 for 
efficiency). 

Importantly, rather than estimating diffraction losses based on physical approxi- 
mations such as the Uniform Theory of Diffraction [59] that cost CPU, the system 
exposes an abstract “diffraction coefficient” that varies smoothly as the sound source, 
and corresponding image sources transition between visual occlusion and visibility. 
This ameliorates the key perceptual deficit of audible loudness jumps that result when 
diffraction is ignored. The audio designer can draw a function in the user interface 
to map the diffraction coefficient to loudness attenuation. This design underlines 
how practical systems balance CPU cost, plausible rendering, and artistic control. 
Note how just reducing accuracy to gain CPU is not the path taken: instead, one 
must carefully understand which physical behaviors must be preserved to not violate 
our (stringent) sensory expectations, such as that sound fields rarely show a sudden 
audible variation on small movement in everyday life. 

For modeling the statistical component, the system avoids costly stochastic ray 
tracing in favor of reverberation flow modeled on a room-portal decomposition of 
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the simplified scene. The design is in the vein of [94], with diffuse energy flow on 
a graph composed of rooms as nodes and portals as edges. However, in keeping 
with the primary goal of audio design, the user is free to choose or parametrically 
design individual filters for each room, while the system ensures that the net result 
correctly accumulates reverberation and spatializes it as streaming to the listener 
from (potentially) multiple portals. Again, plausibility, performance, and design are 
prioritized over adherence to accuracy, keeping in mind the primary use case of 
scalable rendering for games and VR. 


3.8.3 Steam Audio and Resonance Audio 


Steam Audio [100] and Resonance Audio [46] are geometric acoustics systems also 
designed for gaming and VR applications with similar considerations as Wwise Spa- 
tial Audio. They both offer HRTF spatialization combined with geometric acoustics 
modeling; however, diffraction is ignored. A distinctive aspect of Steam Audio is 
the capability to precompute room reverberation filters (i.e., the statistical compo- 
nent) directly from scene geometry without requiring any simplification, auralized 
dynamically based on listener location. Resonance Audio on the other hand primar- 
ily focuses on highly efficient spatialization [47] that scales down to mobile devices 
for numerous sources, using up to third-order spherical harmonics. In fact, Reso- 
nance Audio can be used as a plugin within the Wwise audio engine to perform 
spatialization, illustrating the utility of the modular design of auralization systems 
(Sect. 3.2). 


3.8.4 Project Acoustics (PA) 


We now consider a wave-based system, Project Acoustics [66], which has shown 
practical viability for gaming [81] and VR [45] experiences recently. We summarize 
its key design ideas here; technical details can be found in [26, 78, 79]. As is typical 
for wave acoustics systems (Sect. 3.7.2), costly simulation is performed in a pre- 
computation stage, shown on the left of Fig. 3.4. Many simulations are performed in 
parallel that collectively sample and compress the entire BIR field D(t, s, s’, x, x’) 
into an acoustic dataset. With today’s commodity cloud computing resources, com- 
plete game scenes may be processed in less than an hour. 

The bidirectional reciprocity principle (3.7) plays an important role. The listener 
location, x, is typically restricted in motion to head height above the ground, thus 
varying in two dimensions rather than three, such as the floors of a building. Potential 
listener locations are sampled in the lowered dimension adapting to local geome- 
try [25]. Note that source locations, x’, may still vary in three dimensions. Then, a 
series of 3D wave simulations are performed with each potential listener location 


106 N. Raghuvanshi and H. Gamper 


=e lookup: (x, x’) 


acoustic 
dataset perceptual 


parameters 
TT 


aesthetic 
modification 


wave simulation parameter fields | spatialization 


Preprocessing | Runtime 


Fig. 3.4 High-level architecture of Project Acoustics’ wave-based parametric auralization 


acting as source during simulation. The reduction in BIR field dimension by one 
yields an order-of-magnitude reduction in data size. 

Project Acoustics’? main idea is to employ lossy perceptual encoding on the 
BIR field to bring it within practical storage budgets of a few hundred MB. The 
deterministic-statistical decomposition is employed at this stage. The initial arrival 
time and direction are encoded explicitly to ensure the correct localization of the 
sound, and the rest of the response is encoded statistically (i.e., ng = 1 referring 
to Sect. 3.6.1). An example simulation snapshot is shown in Fig. 3.4 with the cor- 
responding initial path encoding visualized on the right. Color shows frequency- 
averaged loudness, and arrows show the localized direction at the listener location, 
x, with the source location x’ varying over the image. For instance, any source inside 
the room would be localized by the listener as arriving from the door, so the arrows 
inside the room consistently point in the door-to-listener direction. The perceptual 
parameters vary smoothly over space, mirroring our everyday experience, allowing 
further compression via entropy coding [78]. 

The statistical component simplifies (3.14) further to average over all simulated 
frequencies, approximating the bidirectional energy decay relief as 


D,E, ©, 3,8 x, x") = po(S, 8's x, x!) 1076/700, 


The directions {5, s’} sample the six signed Cartesian directions, thus discretizing 
Po toa 6 x 6 “reflections transfer” matrix that compactly approximates directional 
reverberation, alongside a single Teo value across direction and frequency. Visual- 
izations of the reflections transfer matrix can be found in [26] that illustrate how 
it captures anisotropic effects like directional reverberation from portals or nearby 
reverberant chambers. 

One can observe that this encoding is quite simplified and can be expected to 
only plausibly reproduce the simulated BIR field. The choices result from the sys- 
tem’s goal: capturing key geometry-dependent audio cues within a compact storage 
budget—too large a size simply obviates practical use. For instance, one could encode 
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much more detailed information such as numerous (ng ~ 20—50) individual reflec- 
tion peaks [80] but that is far too costly, in turn motivating recent research on how one 
might trade between number of encoded peaks (nz) and perceived authenticity [18]. 

Generally speaking, precomputed systems shift the trade-off from quality-versus- 
CPU as with runtime propagation simulation to quality-versus-storage (Sects. 3.8.1 
and 3.8.2). This holds regardless of whether the precomputation is geometric (Steam 
Audio) or wave-based (Project Acoustics). Precomputation can introduce limitations 
such as slower artist turnaround times and static scenes, but in return significantly 
lowers the barrier to viability whenever the available CPU is severely restricted, 
which is the case for gaming applications or untethered VR platforms. 

Wave simulation forces precomputation in today’s systems due to its high compu- 
tational cost, but its advantage compared to geometric methods is that complex visual 
scene geometry is processed directly, without requiring any manual simplification. 
Further, arbitrary order of diffraction around detailed geometry in general scenes 
(trees, buildings, chairs, etc.) is modeled, which avoids the risk of not sampling a 
salient path. In sum, one pays a high, fixed precomputation cost largely insensitive 
to scene complexity, and if that is feasible, obtains robust results directly from visual 
geometry with a low CPU cost. 

As discussed in Sect. 3.6.2, parametric approaches enable intuitive controls for 
sound designers, which is of crucial importance in gaming applications, as we also 
saw in the design of the Wwise Spatial Audio system. In the case of PA, the parameters 
are looked up at each source-listener location pair at runtime (right of Fig. 3.4), and 
it becomes possible for the artist to specify dynamic aesthetic modifications of the 
physically-based baseline produced by simulation [44]. The sounds and modified 
acoustic parameters can then be sent to any efficient parametric reverberation and 
spatialization sub-system for rendering the binaural output. 


3.9 Summary and Outlook 


Creating an immersive and interactive sonic experience for virtual reality applications 
requires auralizing complex 3D scenes robustly and within tight real-time constraints. 
To meet these requirements, real-time systems follow a modular approach of dividing 
the problem into sound production, propagation, and spatialization. These can be 
mathematically formulated via the source directivity function, bidirectional impulse 
responses (BIR), and head-related transfer functions (HRTFs), respectively, leading 
to a general framework. Human auditory perception of acoustic responses deeply 
informs most systems, motivating optimizations such as the deterministic-statistical 
decomposition of the BIR. 

We discussed many design considerations that inform the design of practical sys- 
tems. We illustrated with a few auralization systems how the application requirements 
shape design choices, ranging from perceptual authenticity in architectural acous- 
tics, to game engines where believability, audio design, and CPU usage take central 
priority. With more development, one can hope for auralization systems in the future 
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that are capable of scaling their quality-compute trade-offs to span all applications 
of VR auralization. Such a convergent evolution would be in line with current trends 
in visual rendering where off-line photo-realistic rendering techniques and real-time 
game techniques are becoming increasingly unified [33]. 

Looking to the future, real-time auralization faces two major research challenges: 
scalability and scene dynamics. Game and VR scenes are trending toward completely 
open worlds where entire cities are modeled at once, spanning tens of kilometers, with 
numerous sound sources, where very few assumptions can be made about the scene’s 
geometry or complexity. Similar considerations hold for engineering prediction of 
outdoor acoustics, such as noise levels in a city. We need real-time techniques that 
can scale to such challenging scenarios within CPU budgets, perhaps by analogy with 
level-of-detail techniques used in graphics. Scene dynamism is a related challenge. 
Many current game engines allow the users to make global changes to immersive 3D 
worlds in real time. Dynamic techniques are required that can model, for instance, 
the diffraction loss around a just-created wall within tolerable latency. Progress in 
this direction has only just begun [35, 75, 83, 84]. 

The open challenge for the future is to build real-time auralization systems that can 
gracefully scale from plausible to accurate audio rendering for complex, dynamic, 
city-scale scenes depending on available computational resources. There is much to 
be done, and many undiscovered, foundational ideas remain. 
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Chapter 4 A) 
System-to-User and User-to-System get 
Adaptations in Binaural Audio 


Lorenzo Picinali and Brian F. G. Katz 


Abstract This chapter concerns concepts of adaption in a binaural audio context (i.e. 
headphone-based three-dimensional audio rendering and associated spatial hearing 
aspects), considering first the adaptation of the rendering system to the acoustic 
and perceptual properties of the user, and second the adaptation of the user to the 
rendering quality of the system. We start with an overview of the basic mechanisms of 
human sound source localisation, introducing expressions such as localisation cues 
and interaural differences, and the concept of the Head-Related Transfer Function 
(HRTF), which is the basis of most 3D spatialisation systems in VR. The chapter then 
moves to more complex concepts and processes, such as HRTF selection (system- 
to-user adaptation) and HRTF accommodation (user-to-system adaptation). State- 
of-the-art HRTF modelling and selection methods are presented, looking at various 
approaches and at how these have been evaluated. Similarly, the process of HRTF 
accommodation is detailed, with a case study employed as an example. Finally, the 
potential of these two approaches are discussed, considering their combined use in 
a practical context, as well as introducing a few open challenges for future research. 


4.1 Introduction 


Binaural technology is the solution for sound spatialisation which is the closest to 
real-life listening. It attempts to mimic the entirety of acoustic cues associated with the 
human localisation of sounds, reproducing the corresponding acoustic pressure signal 
at the entrance of the two ear canals of the listener (binaural literally means “related to 
two ears”). These two signals should be a complete and sufficient representation of the 
sound scene, since they are the only information that the auditory system requires in 
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order to identify the 3D location of a sound source. Thus, binaural rendering of spatial 
information is fundamentally based on the production (either through recording or 
synthesis) of localisation cues that are the consequence of the incident sound upon 
the listener’s torso, head, and ears on the way to the ear canal, and subsequently to 
the eardrums. These cues are, namely, the ITD (interaural time difference), the ILD 
(interaural level difference) and spectral cues [48, 68]. Their combined effects are 
represented by the Head-Related Transfer Function (HRTF), which characterises the 
spectro-temporal filtering of a locus of source positions around a given head.! 


4.1.1 Localisation Cues and Their Individual Nature 


The ILD and ITD as a function of source position are determined principally by 
the size and shape of the head, as well as the position of the ears on the two sides. 
In order to better understand these localisation cues, Fig. 4.1 shows how ITD and 
ILD vary as a function of both distance (1.5—10m) and azimuth. This comparison 
highlights potential effects of ITD/ILD mismatch, especially if they occur near the 
interaural axis where they can affect distance perception. The results were obtained 
by Boundary Element Method (BEM) simulation of the HRTF using the open-source 
mesh2hrtf£ software [110, 111]. The mesh employed was obtained from an MRI 
scan of a Neumann dummy recording head (model KU-100), previously used in 
HRTF computation [32] and measurement [4] comparisons. These cues vary as a 
function of frequency. For this example, the ITD was calculated using the Thresh- 
old Ip -30 dB method (for a summary of various ITD estimation methods see [50]), 
which detects the first onset using a —30 dB relative threshold on a 3 kHz low-pass fil- 
tered version of the HRIR, as this has been shown to be the most perceptually relevant 
method for ITD estimation among 32 different estimation methods and variants [7, 
50]. The ILD was calculated as the difference of left and right HRIR RMS values, 
after applying a 3kHz high-pass filter. The use of low-pass and high-pass filters for 
the two different acoustic cues is based on previous studies showing the frequency 
dependence of the different auditory cues [101], with ITD being dominated by low- 
frequency content (with interpretation of phase information being inconclusive for 
frequencies smaller than head dimensions) and ILD varying more significantly with 
high-frequency content (where the wavelength is less than the dimensions of the 
head). The application of a 2-3 kHz filter can be used to generally separate the con- 
tributions of the pinnae in the HRIR [50]. One can observe that ITD varies little over 
the simulated distance range, while becoming more vague and ambiguous near the 


' We use the term HRTF to indicate the set of filters, each representing a pair of transfer functions 
from a point source in space at a given distance around a given head to the left and right ear, 
normalised by the transfer function with the body absent. The plural, HRTFs, therefore, represents 
acollection of more than one HRTF, typically for different heads or test conditions. The head-related 
impulse response or HRIR is the time domain transform of the HRTF. 
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Fig. 4.1 Isocontours for ITD (left) and ILD (right) as a function of azimuth (in degrees) and radial 
distance (from 1.5 to 10m) obtained via numerical simulation of the HRTF of a dummy head (not 
shown to scale). ITD (3 kHz low-pass Head-Related Impulse Response—HRIR, Threshold, -30 dB 
first onset method) 50 us contours. ILD (3 kHz high-pass HRIR, RMS difference) 1 dB contours 
(from [48]) 


interaural axis. In contrast, the ILD varies with distance in the same interaural axis 
range of 70°-110°. 

Other physical interactions between the sound wave and the torso, head, and pin- 
nae (the external parts of the ear) introduce a range of spectral cues (principally 
through series of peaks and notches) which can be used to judge whether a sound 
source is e.g. above or below, to the front or rear of the listener, while ITD and ILD 
remain relatively unchanged. Considering the various morphological regions of the 
pinnae, as indicated later in Sect. 4.2.1—Fig. 4.2a, each of these is potentially related 
to specific characteristic of the HRTF filters. As such, individual morphological vari- 
ations will result in different HRTFs. When reproducing binaural audio, it has been 
experimentally demonstrated that using an HRTF that does not match the one of 
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the listener has a detrimental effect on the accuracy and realism of virtual sound 
perception. For example, it has been noted that listeners are able to localise virtual 
sounds that have been spatialized using their own HRTFs with a similar accuracy 
to free field listening, though some studies have shown poorer elevation judgements 
and increased front-back confusions [67], which may be due to the idealised ane- 
choic nature of HRTFs and the importance of slight head movements and associated 
dynamic cues [37, 102]. These errors can significantly increase when using someone 
else’s HRTF [99]. Furthermore, using non-individual HRTFs (see Sect. 4.1.2) has 
been shown to affect various perceptual attributes when considering complex scenes, 
in addition to those associated with source localisation: i.e. Coloration, Externalisa- 
tion, Immersion, Realism and Relief/Depth [87]. In this chapter, the primary focus 
is on localisation as the perceptual evaluation metric. Chapter 5 introduces and dis- 
cusses other relevant metrics. 


4.1.2 Minimising HRTF Mismatch Between the System 
and the Listener 


Various means have been investigated to minimise erroneous or conflicting binaural 
acoustic localisation cues relative to the natural cues delivered to the auditory sys- 
tem and, as such, improve the quality of the resulting binaural rendering. Majority 
of research has focused on improving the similarity between the rendering sys- 
tems’ localisation cues and those of the individual listener. This is generally termed 
“individualisation” or “individualised” binaural rendering. To clarify questions of 
nomenclature, we propose the following terms: 


e individual to identify the HRTF of the user; 

e individualised or personalised to indicated an HRTF modified or selected to best 
accommodate the user; 

e non-individual or non-individualised to indicate an HRTF that has not been tailored 
to the user and 

e dummy head or so-called generic HRTF sets are specific instances of non- 
individual HRTFs, often designed with the goal of representing a certain pool 
of subjects. 


While not exhaustive, a general overview of individualisation methods is discussed 
here. 


Binaural Recordings and Synthesis 


The first and most direct method to create an individual rendering is to perform 
the recording with binaural microphones placed in the ear canal of the listener. This 
is however, in most cases, an impractical solution. The second still rather direct 
method is to measure the HRTF of an individual for a collection of spatial positions 
and to then use this individual HRTF to produce an individual binaural synthesis 
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rendering through convolution of the sound source with the relevant incident direc- 
tion HRTF [14, 105]. While this is the most common method employed to date, 
it is generally limited to those with the facilities and equipment to carry out such 
measurements [4]. 

The general pros and cons between binaural recordings and binaural synthe- 
sis merit mention. While individual binaural recordings provide arguably the most 
accurate 3D audio capture/reproduction method, they require the sonic environment 
and the individual to be situated accordingly. For any reasonable production, this 
would resemble a theatrical piece being performed around the individual in a first 
person context. The recording would capture the acoustic detail of the soundscape, 
including reflections from various surfaces, diffraction and scattering effects. How- 
ever, the head orientation of the individual would be encoded into the recording, 
imposed on the listener at playback. If presented to another individual, the issues of 
HRTF mismatch are introduced, degrading the spatial audio quality to an unknown 
degree for each individual. In laboratory conditions, this method suffers additional 
difficulty, as the individual takes part in the recording, making the presentation of 
unfamiliar material difficult. In contrast, binaural synthesis allows for the scripting, 
manipulation and mixing of 3D scenarios without the intended listener present. With 
real-time synthesis, head tracking can be incorporated allowing freedom of move- 
ment by the individual, a basic requirement for VR applications. HRTF mismatch is 
alleviated through the use of individual HRTFs. However, the quality of the produc- 
tion is affected by the level of detail in the acoustic simulation of the environment, 
including elements such as source and surface properties. Highly complex scenes 
and acoustic environments can require significant computational resources (the inter- 
ested reader can refer to Chap. 3 for further details on this topic). Spatial synthesis 
using HRTF data is also affected by the measurement conditions of the employed 
HRTF, predominantly the measurement distance. If sound sources are to be rendered 
at various distances, this requires either multiple HRTF datasets, or deformation of 
the individual HRTF data to approximate such changes in distance. Further discus- 
sion of these details is beyond the scope of this chapter. In continuing, the focus will 
be limited to questions concerning the individual nature of the HRTF as integrated 
into an auditory VR environment through binaural synthesis. 


Introduction to System-to-User and User-to-System adaptation 


A variety of alternative methods exist in order to improve the match between the 
HRTF used for the rendering and the specific HRTF of the listener. It is the aim of 
this chapter to present an overview of those approaches that have been evaluated 
and validated through experimental research. In order to map the various methods 
and at the same time simplify the narrative and facilitate the reading, the text has 
been organised in two separate sections. Section 4.2 presents research which looks 
at matching the rendering system to the specific listener (system-to-user adaptation), 
thus aiming to provide every individual with the best HRTF possible. Section 4.3 
looks at the problem from a diametrically opposite point of view, introducing studies 
where the listener is trained in order to adapt to the rendering system (user-to-system 
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adaptation), therefore aiming at improving the performance of a specific individual 
when using non-individual HRTFs. 

While a rather extensive number of studies exist on the topic of system-to-user 
adaptation, a more limited amount of research has been carried out focusing on user- 
to-system adaptation. For this reason, while Sect. 4.2 is presented as an extensive 
review of several research projects, Sect. 4.3, after an initial overview, then dives 
more in depth into one specific study carried out by this chapter’s authors, giving 
details of the methodology and briefly discussing the results. Section 4.5 concludes 
by presenting a brief overview of open challenges on this topic. 


4.2 System-to-User Adaptation: HRTF Synthesis and 
Selection 


Two main approaches exist for obtaining individual (or at least personalised) HRTFs 
without having to measure them acoustically. The first one focuses on numerical 
simulations, therefore using mathematical methods to generate an HRTF for a given 
individual from 3D models of the head, torso, and pinnae. Techniques such as the 
Boundary Element Method (BEM), Finite Element Method (FEM), and Finite Dif- 
ference Time Domain (FDTD) method which are commonly employed in diffraction, 
scattering, and resonance problems allow one to calculate the HRTF of a given indi- 
vidual based on precise geometrical data (e.g. coming from a 3D scan of the head and 
pinnae), which have been used for this purpose since the late 1990s, and have shown 
increased uptake and success in the past years thanks to technological advancements 
in domains such as high-performance computing and high-resolution 3D scanning. 
An example of such a resulting 3D mesh from a Neumann KU-100 dummy head 
can be seen in Fig. 4.2b. The second one relies on using HRTFs from available 
datasets, either transforming them in order to provide a better fit for a given listener 
or selecting a best fit considering, for example, preference or performance, e.g. using 
a sound localisation task or signal metric. Due to the relative independence between 
the ITD and the Spectral Cues, the HRTF can be decomposed and different elements 
addressed by different methods, e.g. an ITD structural model can be used with best 
fit selected Spectra Cues [22, 78]. 

As can be expected, each of these approaches comes with specific challenges. 
Moreover, the success in employing one or the other depends significantly on factors 
such as the available data (quantity and quality), the time constraints in order to run 
the tests and the calculations, and the context for which the rendering is needed (i.e. 
the requirements in terms of quality, interactivity, etc.). An overview of the various 
techniques and related challenges, including solutions found through state-of-the art 
research studies, is presented in the following sections. 
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Fig. 4.2 Pinna morphology nomenclature and example BEM mesh (from [91]) 


4.2.1 HRTF Modelling 


Various attempts have been made to investigate the function of the pinna, linking 
HRTFs to its morphology as well as that of the head and torso. Early work by Teran- 
ishi and Shaw [93] looked at creating a physical model of the pinnae and analysing 
the various excitation modes generated by a nearby point source. The model, based 
on very simple geometries, showed responses similar to those of real data, and rep- 
resented one of the first steps towards better understanding the spatially varying 
acoustic role of the pinna. Similar work was done by Batteau [12], who created a 
mathematical representation of the acoustical transformation performed by the pinna 
and produced the first mathematically described theory of sound source localisation 
based on a reflection-diffraction model. These studies were the baseline of research 
carried out 30 and more years later, when the available computational power allowed 
to create more complex models, and to validate those by comparing them with exper- 
imental measures (e.g. [58]). Further modelling work was carried out looking at 
simplified models and approximations. Notable examples are those of Genuit [26] 
based on a structural simplification model of the pinnae, Algazi and colleagues [1] 
based on an approximation of the head and the torso using ellipsoidal and spherical 
models, and Spagnol and colleagues [89] looking at ray-tracing analysis of pinna 
reflection patterns. It is relevant to note that many of the early studies focused on 
models for understanding the various phenomena and principles involved, rather than 
models for binaural audio rendering. For these early studies, much of the research 
on spatial perception was carried out independently from acoustical/morphological 
studies regarding the details of the pinnae. 
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One of the first experiments using these techniques applied to HRTFs (including 
pinnae) was carried out by Katz [49, 51, 52]. This work focused on using BEM to 
calculate HRTFs by modifying various aspects of the geometrical models, for exam- 
ple, eliminating the pinna, changing the size and shape of the head, and accounting 
for hair acoustic impedance. Results from numerical simulations were then compared 
with experimental measures, validating the technique and improving our understand- 
ing of the role of the pinnae in modifying the incoming sound in a direction-dependent 
manner. Similar work was carried out in the same period by Kahana [44, 46]. Such 
simulations were initially limited, due to computational resources, to an upper fre- 
quency of 6 kHz, then extended to 10 and 20 kHz in later studies [32, 45]. Even 
in these cases the validation was performed comparing the numerical model results 
with experimental measurements showing a good match between the two, also in 
light of the variances observed between different HRTF measurement systems for 
the same individual [4, 47]. The computational complexity of these numerical meth- 
ods was a major limitation in the early years of using this technique for generating 
HRTFs. Various optimisation techniques are being proposed [35, 55, 70], allowing 
significantly faster computation times with reasonable processing resources (i.e. no 
longer needing super computers). This led to the development of easy-to-use and 
open-source tools for the numerical calculation of HRTFs. A notable example is 
mesh2hrtf [110], a software package centred on a BEM solver, as well as tools 
for the pre-processing of geometry data, generation of evaluation grids and post- 
processing of calculation results. It is essential here to consider a major challenge to 
be tackled when approaching HRTF synthesis from geometrical models, which is the 
acquisition and processing of the 3D models from which the HRTFs are computed. 
Evaluations of various 3D scanning methods, specifically looking at capturing the 
geometry of the pinnae, have been carried out [44, 69, 80]. 

Numerical simulations also brought significant benefits with regard to repeata- 
bility, replicability and reproducibility. A comparison of different numerical tools 
for simulating an HRTF from scan data by Greff and Katz [32] (here employing 
the high-resolution scan of a Neumann KU-100 shown in Fig. 4.2b) showed little 
variance. In contrast, a similar comparison of acoustical HRTF measurements using 
the same head at different laboratories [4] showed significant variations between 
resulting HRTFs. Another significant advantage of numerically modelling HRTFs 
rather than measuring them is that with physical measurements on human subjects 
it is difficult or impossible to isolate the influence of different morphological char- 
acteristics on the actual HRTF filters. 


Morphological Relationships 


Exploring and modelling the relationship between geometrical features and filter 
characteristics is indeed a very important step for advancing our understanding of 
the spatial hearing processes. Research in this area was strongly advanced with the 
distribution of the CIPIC HRTF database [2], which included associated morpho- 
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Fig. 4.3 Two pinna created with the parametric model developed in [91] 


logical parameter data for most subjects. This effort was followed with the LISTEN 
HRTF database [98], providing similar data. Benefiting from the power of numer- 
ical simulation and controlled geometrical models, Katz and Stitt [91] investigated 
the effect of morphological changes by varying specific morphological parameters, 
an extension of the CIPIC set of morphological parameters to provide more unique 
solutions. In order to do this, they created a Parametric Pinna Model (PPM) and 
with BEM they investigated the sensitivity of the HRTF to specific morphological 
alterations. Examples of pinnae created using this PPM can be seen in Fig. 4.3. 
Evaluations included the use of auditory models [88] to identify those morpholog- 
ical changes most likely to affect spatial hearing perception. In line with previous 
studies, morphological features near to the rear of the helix were found to have little 
influence on HRTF objective metrics, while the dimension of the concha had a much 
more relevant impact, both looking at the directional and diffuse HRTF spectral com- 
ponents. ? Other relevant findings include the importance of the region around the 
triangular fossa, which is often not considered when looking at HRTF personalisa- 
tion, and the fact that the relief (or depth, directions parallel to the interaural axis) 
parameters were found to be at least as important as side-facing parameters, which 
are more frequently cited in morphological/HRTF studies. 

Such interest in binaural audio, combined with major advancements in terms of 
available technologies, has encouraged the publication of large datasets of BEM- 
generated HRTFs and correspondent high-accuracy 3D geometrical models. An 
example is the Sydney York Morphological and Acoustic Recordings of Ears 
(SYMARE) database [42], which was then followed by other examples of either head- 
related or more reduced complexity pinnae-related datasets [18, 34]. The availability 


2 The diffuse field component is the spatial average of the HRTF. When removed from the HRTF, 
the result is a diffuse field equalised directional transfer function (DTF) [64]. 


124 L. Picinali and B. F. G. Katz 


of such large datasets opened the door to the use of machine learning approaches to 
tackle the issue of morphology-based HRTF personalisation. An example is the work 
by Grijalva and colleagues [33], where a non-linear dimensionality reduction tech- 
nique is used to decompose and reconstruct the HRTF for individualisation, focusing 
on elements which vary the most between positions and across individuals. Results 
may offer improved performance over linear methods, such as principal component 
analysis (e.g. [81]). 


HRTFs, Binaural Models and Perceptual Evaluations 


It is evident that since the 1990s a large amount of work has been carried out looking 
at synthesising HRTFs and better understanding the relationship between these and 
morphological features of the pinnae, head and torso. Nevertheless, it must be reiter- 
ated that very few of the reviewed studies have included perceptual evaluations on the 
modelled HRTFs [18, 56], and that in no case such subject-based validations were 
extensive enough to fully support the use of synthesised HRTFs instead of measured 
ones. It is therefore clear that significant research is still needed in order to develop 
and validate models that can describe, classify and ultimately generate individual 
HRTFs from a reduced set of parameters. 

While numerical assessments can be very useful when trying to better explain 
experimental results, they cannot be the only way to explore and validate the quality 
of the rendering choices. Binaural models (e.g. [88]) could become an invaluable 
tool to help overcome such limitations, as they offer a computational simulation of 
binaural auditory processing and, in certain cases, also allow to predict listeners’ 
responses to binaural signals. Using them, it is possible to rapidly perform com- 
prehensive evaluations that would be too time-consuming to implement as actual 
auditory experiments (e.g. [17]). 

An example of this approach can be found in [29], where an anthropometry-based 
mismatch function between HRTF pairs, looking at the relationship between pinna 
geometry and localisation cues, was used to select an optimal HRTF for a given 
individual, specifically looking at vertical localisation. The outcome of the selection 
was then evaluated using an auditory model which computed a mapping between 
HRTF spectra and perceived spatial locations. While this study outlined that the best 
fitting HRTF selected with the proposed method was predicted to yield a significantly 
improved vertical localisation when compared to a selected generic HRTF, it must be 
reiterated that the reliability of perceptual models is still to be thoroughly validated, 
and potential biases can be identified and dealt with only through actual perceptual 
evaluations. Another similar application of binaural models has been recently pub- 
lished, focusing on the comparison between different Ambisonics-based binaural 
rendering methods [25]. The very large number of independent variables (e.g. each 
method was tested with Ambisonics orders from 1 to 44), as well as the complex- 
ity of the interactions between such variables, would make it very challenging to 
run perceptual evaluations with subjects. This study showed not only that models’ 
predictions were consistent with previous perceptual data, but also contributed to 
validate the models’ ability to predict user responses to binaural signals. 
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It is likely that models will never be able to provide 100% accurate assessments 
near to the zone of perfect reproduction, in part due to the difficulties in modelling 
processes such as cognitive loading and procedural/perceptual learning. However, 
it is reasonable to expect them to provide broadly correct predictions for larger 
errors. This means that they could be particularly useful when prototyping rendering 
algorithms and designing HRTF personalisation experiments, in order to rapidly 
reduce the number of conditions and variables which are subsequently assessed 
through real subject-based perceptual evaluations. 

Artificial intelligence and machine learning should play an important role in such 
future research, looking at improving both HRTF synthesis and selection processes, 
as well as perceptual models accuracy and reliability. 


4.2.2 HRTF Selection 


A different approach for obtaining individual (or at least personalised) HRTFs with- 
out having to acoustically measure them is to rely on available HRTF databases, either 
transforming/tuning the transfer function according to certain subjective criteria, or 
designing a process for selecting the best fitting HRTF for a given subject. Regarding 
the first option, as mentioned at the beginning of this section, it is generally known that 
frequency-independent ITDs from a given HRTF can be modified and personalised 
according to e.g. the head circumference of a given listener [9]. Such a technique is 
implemented in a few binaural spatialisers [22, 78]. However, the personalisation of 
other HRTF features, such as monoaural and interaural Spectral Cues, presents more 
significant challenges. Early works in this direction looked at improving vertical 
localisation by scaling the HRTF in frequency [64, 65]. Other “simpler” approaches 
to tuning were found to be effective, for example, by manually modifying frequency 
and phase for every HRTF direction, for the left and right ears independently [86]. 
Hwang and colleagues [40] carried out a principal component analysis on the CIPIC 
HRTFs and used the output components to develop a customisation method based 
on subjective tuning of a generalised HRTF. Such customisation allowed listeners 
to perform significantly better in vertical perception and front-back discrimination 
tasks. The same approach was used to modify and personalise a KEMAR HRTF, 
resulting also in this case in significantly improved vertical localisation abilities [84]. 


HRTF Selection Methods 


Methods for selecting a best fit HRTF based on subjective criteria can be grouped 
into two general categories: physical measurement-based matching and perceptual 
selection. The first pertains to selecting an HRTF from an existing set based on mor- 
phological measurements or sparse acoustical measurements. Of importance is the 
determination of the relevant morphological features, as they pertain to spatial hearing 
and HRTF-related cues, as examined by [91]. Zotkin and colleagues [112] looked at 
a selection strategy based on matching certain anthrophometric pinnae parameters of 
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the specific subject with those of HRTFs within a dataset, while providing associated 
low-frequency information using a “head-and-torso” model. Comparison between 
a non-personalised HRTF and the selected HRTF via this method showed height- 
ened localisation accuracy and improved subjective perception of the virtual auditory 
scene when using the latter. A similar approach was used by [81], where advanced 
statistical methods were employed to create a subset of morphological parameters, 
which were then employed for predicting what might be the subject’s preferred HRTF 
based on measurement matching. HRTFs selected using this method performed bet- 
ter than randomly selected ones. An alternate selection perspective was proposed 
in [30], where a reflection model was applied to the picture of the pinnae of the 
subject, facilitating the extraction of relevant anthropometric parameters which were 
then used for selecting one or more HRTFs from an existing database. This selection 
method resulted in a significant improvement in elevation localisation performances, 
as well as an enhancement of the perceived externalisation of the simulated sources. 
The relationship between features of the pinna shape and HRTF notches, focusing 
specifically on elevation perception, was successfully used in [27] for selecting a best 
fitting HRTF from pinna images. Interestingly, studies on Spectral Cues have sug- 
gested the importance of notches over peaks in the HRTF [31]. Another work from 
Geronazzo and colleagues [28] introduced a rather original approach by developing 
the Mixed Structural Modelling (MSM), a framework for HRTF individualisation 
which combines structural modelling and HRTF selection. The level of flexibility 
of this solution, which allows to mix modelled and recorded components (therefore 
HRTF selection and synthesis), is particularly promising when looking at the HRTF 
personalisation process. 


HRTF Evaluation 


It must be highlighted that whether selection is based on measured or perceptual data, 
the evaluation of said method is necessarily perceptual as the final application is a 
human-centred experience. With this in mind, a fundamental yet unanswered ques- 
tion is: “What determines the suitability of an HRTF for a given subject?” [48]. When 
establishing whether an HRTF is a good fit, should one look at how precisely sound 
sources can be localised using that HRTF (direct approaches), or should other sub- 
jective metrics (e.g. realism, spatial quality or overall preference) be employed [87]? 
In employing perceptual selection, the choice of protocol becomes more critical. 
In addition, as was observed with acoustical measurements, the repeatability of the 
measurement apparatus (here the response of human subjects) must be examined and 
taken into account. As an example, past studies using binaural audio rendering for 
applications other than spatial hearing research (e.g. [74]) relied on simple percep- 
tually based HRTF selection procedures which, at a later stage, resulted in being less 
repeatable than originally thought [6]. Without extensive training as seen in some of 
the principal earlier studies, the reliability of naive listeners (those situations which 
are also more representative of applied uses of binaural audio rather than studies on 
fundamental auditory processing) must be taken into account. Early studies on HRTF 
selection through ratings [53, 74] assumed innate reliability in quality judgements. 
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More recently, studies have shown that such reliability cannot be assumed, but must 
be evaluated, with some listeners being highly repeatable while others are not [6]. 

It can be assumed that different HRTFs will, for a given subject, result in different 
performances in a sound source localisation task. From this we can infer that an 
optimal HRTF could be selected looking at such performances, for example, using 
metrics such as localisation errors and front-back and up-down confusion rates (see 
Sect. 4.3.2 for metric definitions). This assumption has been the baseline of several 
studies where an HRTF selection procedure was designed and evaluated based on 
localisation performances [41, 83, 96]. Such methods previously required specialised 
hardware, though current consumer Virtual Reality (VR) devices, thanks to their 
increasingly higher performance in terms of tracking capabilities (e.g. [43]), can now 
be employed for rendering and reporting the perceived direction of the sound source. 
However, these methods still remain rather time-consuming, as a large number of 
positions across the whole sphere should be evaluated in order to obtain reliable 
results. 

Alternatively, HRTF selection can be the result of subjective evaluations based 
on indirect quality judgement approaches. Several research works have looked at 
asking listeners to rate HRTFs based on the perceived quality of some descriptive 
attributes, from the overall impression [106] to how well the auditory presentation 
matched specifically described locations or movements of the virtual source [53, 83, 
85] (e.g. Fig. 4.4). Several methods have been introduced for ultimately being able 
to select one or more best performing HRTFs; these include ranking [83], rating on 
scales [6, 53, 82], multiple selection-elimination rounds [97] and pairwise compar- 
isons [85, 106]. In general, there seems to be an agreement on the fact that expert 
assessors (as defined by [107]) perform significantly better (i.e. in a more reliable 
and repeatable manner) if compared with initiated assessors [6, 54]. To gain further 
insight into indirect method results, some work has been carried out to develop global 
perceptual distance metrics with the aim to describe both HRTF and listener simi- 
larities [8]. In addition to proposing and evaluating a set of perceptual metrics, this 
work encourages further research into novel experiment design which could help in 
minimising the need for data normalisation and, more importantly, outlines the need 
for further investigations on the stability of these perceptual experiments/evaluations, 
specifically looking at repeatability and training. 
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Few studies have examined the similarity between direct (i.e. localisation perfor- 
mances) and indirect HRTF selection methods. Using an immersive VR reporting 
system for the localisation test, results from [108] indicated a significant and pos- 
itive mean correlation between HRTF selection based on localisation performance 
and HRTF ranking/selection based on quality judgement; the best HRTF selected 
according to one method had significantly better rating according to metrics in the 
other method. In contrast, using a gestalt reporting method through the use of an 
avatar representation of the listener’s head, results from [54] showed no signifi- 
cant correlations. A number of protocol differences exist between these two studies, 
including the type of tasks used for both methods, the user interface (see [10, 11] 
regarding localisation reporting method effects), the stimuli signals, as well as the 
metrics evaluated in the quality judgement task. 


4.3 User-to-System Adaptation: HRTF Accommodation 


The previous section examined HRTF selection and individualisation methods in 
the signal domain. While such methods aim to provide every individual user with 
the best HRTF possible, such approaches are not always available in all conditions. 
However, evidence is increasingly available showing that the adult brain is adaptable 
to environmental changes. It has been demonstrated that this adaptability (or plas- 
ticity) regarding spatial auditory processing can lead to a reduction in localisation 
error over time in the case when a listener’s normal localisation cues are significantly 
modified. 

It has been established that one can adapt to modified HRTFs over time, with 
ear moulds inserted in the pinnae [19, 38, 94, 95], or with non-individual HRTFs 
through binaural rendering [73, 77, 90, 92, 99, 109]. Studies have shown that one can 
adapt to distorted HRTFs, e.g. in [60] where participants suffering from hearing loss 
learned to use HRTFs whose spectrum had been warped to move audio cues back into 
frequency bands they could perceive. HRTF learning is not only possible, but lasting 
in time [62, 92, 109]: users have been shown to retain performance improvements up 
to 4 months after training [109]. Given enough time, participants using non-individual 
HRTFs may achieve localisation performance on par with participants using their 
own individual HRTFs [73, 77, 92]. 

This concept has been successfully used to improve user localisation performance 
within virtual auditory environments when using non-individual HRTFs. Readers are 
referred to [61, 104] for more general reviews on the broader topic of HRTF learning. 
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4.3.1 Training Protocol Parameters 


Learning methods explored in previous studies are often based on a localisation 
task. This type of learning is referred to as explicit learning [61], as opposed to 
implicit learning where the training task does not immediately focus participant 
attention on localisation cues [73, 92]. Performance-wise, there is no evidence to 
suggest either type is better than the other. Implicit learning gives more leeway for 
task design gamification. The technique is more and more applied to the design of 
HRTF learning methods [39, 73, 90, 92], and while its impact on HRTF learning 
rates remains uncertain [90], its benefit for learning, in general, is, however, well 
established [36]. On the other hand, explicit learning more readily produces training 
protocols where participants are consciously focusing on the learning process [63], 
potentially helping with the unconscious re-adjustment of auditory spatial mapping. 

As much as the nature of the task, providing feedback can play an important 
role during learning. VR technologies are more and more relied upon to increase 
feedback density in the hope of increasing HRTF learning rates (in Chap. 10, the 
interested reader can find further insights on multisensory feedback in VR). While 
results encourage the use of a visual virtual environment [60], it has been reported 
that proprioceptive feedback alone can be used to improve learning rates [16, 73]. 
Direct comparison of experimental results suggests that active learning with direct 
feedback is more efficient (i.e. leads to faster improvement) than passive learning 
from sound exposure [61]. There is also a growing consensus on the use of adaptive 
(i.e. head-tracked) binaural rendering during training to improve learning rates [19], 
despite the generalised use of static head-locked localisation tasks to assess perfor- 
mance evolution [61]. It is not trivial to ascertain whether the benefit of head-tracked 
rendering comes from continuous situated feedback improving audio cue recalibra- 
tion, or from unbalanced comparison, as static head-locked rendering creates user 
frustration and results in less sound exposure [90]). 

Studies on the training stimulus indicate that learning extends to more than the 
signals used during learning [39, 90]. This result is likely dependent on specific char- 
acteristics of the stimuli and how these relate to auditory localisation mechanisms, 
i.e. whether they present the transient energy and broad frequency content necessary 
for auditory spatial discrimination [24, 57, 72]. 

There is no clear cut result on optimum training session duration and scheduling. 
Training session duration reported in previous studies ranges from ~8 min [66] to 
~2h [60]. Comparative analysis argues in favour of several short training sessions 
over long ones [61]. Training session spread is also widely distributed in the litera- 
ture, ranging from all sessions in one day [57] versus one every week or every other 
week [92]. Where results suggest spreading training over time benefits learning (all 
in | day versus spread over 7 days) [57] outcomes from [73, 92] indicate that weekly 
sessions and daily sessions result in the same overall performance improvement (for 
equal total training duration). There is some example of latent learning (improvement 
between sessions) in the literature [66], naturally encouraging the spread of training 
sessions. Regardless of duration and spread, studies have shown that learning sat- 
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uration occurs after a while. In [59], most of the training effect took place within 
the first 400 trials (~160 min), a result comparable to that reported by [20] where 
saturation was reached after 252 to 324 trials. 

One of the critical questions not fully answered to date is the role of the HRTF 
fit in the training process or how similar the training HRTF is to the actual HRTF of 
the individual. It would appear that a certain degree of affinity between a participant 
and the training HRTF facilitates learning [73, 92]. In contrast, lack of adaptation 
can occur if the HRTF to be learned is too different from one’s own HRTF. This 
is evidenced by mixed adaptation results in studies where ill-suited HRTF matches 
were tested. 


4.3.2 HRTF Accommodation Example 


We present here as an example HRTF learning study by Stitt et al. [92], which 
examined the effect of adaptation to non-individual HRTFs. This study was cho- 
sen for this example as it provides a controlled study over a significant number of 
training sessions. As a “worst-case” real-world scenario, perceptually worst-rated 
non-individual HRTFs were chosen by each subject to allow for maximum poten- 
tial for improvement, another factor of interest in its design. This study is part of 
a series of studies on the subject of user-to-system adaptation, providing continuity 
of comparisons [15, 73, 77]. The methodology consisted of a training game and a 
localisation test to evaluate performance carried out over 10 sessions. Subjects using 
non-individual HRTFs (group W10) were tested alongside control subjects using 
their own individual measured HRTFs (group C10). 

Prior to any training, subjects were assigned non-individual HRTFs based on 
quality judgements of rendered sound object trajectories for 7 HRTF sets, taken as 
“perceptually orthogonal” [53]. These trajectories, shown in Fig. 4.4, were presented 
to subjects as a reference. Following the results of [8], which examined the reliability 
and repeatability of HRTF judgements by naive and experienced subjects, this rating 
task was performed three times, leading to a total of six ratings per subject, counting 
the two trajectories, with the overall judgement rating taken as the overall mean. 
The lowest rated HRTF for each subject was then used as that subject’s worst-match 
HRTF. This method is an improvement over alternate methods which are either 
uncontrolled (e.g. a single HRTF used by all listeners) or limited in the extent of 
relative spectral changes presented to subjects when compared to their individual 
HRTFs. 

The training procedure for the 10 sessions was devised as a simple game with a 
searching task in which the listener had to find a target at a hidden position in some 
direction (0, @), ignoring radial distance. Subjects searched for the hidden target by 
moving the motion-tracked hand-held object around their head (see concept in Fig. 
4.5). For the duration of the search, alternating pink/white noise (50—20000 Hz) with 
an overall level of approximately 55 dBA measured at the ear was presented to the 
listener, positioned at the location of the tracked hand-held object relative to the 
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Fig. 4.5 Training game 
concept design 


subject’s head. This provided a link between the proprioceptively known position of 
the subject’s own hand and spatial cues in the binaural rendering. The alternation 
rate of the pink/white noise bursts increased with increasing angular proximity to the 
target direction using a Geiger counter metaphor [71, 79]. Once the subject reached 
the intended target direction, a success sound would play, spatialised at the target’s 
location. The training game lasted 12 min and subjects were instructed to find as many 
targets as possible in the time available. Sessions 1—4 occurred at 1-week interval, 
while the remaining sessions occurred at 2-week interval. 

It should be emphasised that no auditory localisation on the part of the subject 
was actually required to accomplish this task, only tempo judgements of the alter- 
nation rate of the pink/white noise bursts and proprioceptive knowledge of one’s 
hand position. HRTF adaptation was therefore an implicit result of game play, but 
not the task of the game as far as the participant was aware. This task was designed 
to facilitate learning with source positions outside of the visual field of view, as well 
as to function for individuals with visual impairments. 


Performance Evaluation Metrics 


The HRTF accommodation was evaluated via localisation tests. Subjects were pre- 
sented a brief burst of noise (to limit the influence of any possible head movement 
during playback) and would subsequently point in the perceived direction of the 
sound using the hand-held object. No feedback was given to subjects regarding the 
target position. The noise burst consisted of a train of three, 40ms Gaussian broad- 
band noise pulses (20000 Hz) with 2 ms raised cosine window applied at onset and 
offset and 30ms of silence between each burst [73]. There were 25 target directions 
with 5 repetitions of each target, resulting in the tested sphere including a full 360° 
of azimuth, and -40—90° of elevation. 

Two types of metrics were used to analyse localisation errors: angular and confu- 
sion metrics. The interaural coordinate system defines a lateral and polar angle Fig. 
4.6a. The lateral angle is the angle between the interaural axis and the line between the 
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(a) Interaural coordinate systems. Interaural lat- (b) Definition of the 4 different cone- 
eral angle œ defined in [-90:90], polar angle 8 of-confusion response classification zones: 
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pared to original definition [67]. Listeners facing Combined(from [108]). 

X with their left ear pointing towards Y. 


Fig. 4.6 Interaural polar coordinate system and associated polar angle cone-of-confusion zone 
definitions 


origin and the source. The lateral angle approaches cones-of-confusion along which 
the interaural cues (ITD and ILD) are approximately equal. A cone-of-confusion is 
defined by the contour around the listener for a given ITD or ILD (see Fig. 4.1). For 
ITD, these contours can be generally represented by a hyperbolic function, where 
the difference in arrival time to the two ears is constant and the vertex is on the 
interaural axis, between the two ears. The intersection of the ITD and ILD cones-of- 
confusion for a given stimulus prescribes a closed curve (approaching a circle). The 
ITD and ILD are insufficient to resolve the localisation ambiguity, requiring further 
information, such as from Spectral Cues or head movements. The polar angle is the 
angle between the horizontal plane and a perpendicular line from the interaural axis 
to the point, such that the polar angle prescribes the source location on the cone- 
of-confusion. The polar angle is primarily linked with the monaural, Spectral Cues 
in the HRTF. This independence of binaural and Spectral Cues makes the interaural 
coordinate system a natural choice when looking at localisation performance. If the 
perceived ILD, ITD and Spectral Cues of a given source do not adequately coincide 
with the expectations of the auditory system for a single point in space, uncertainty in 
localisation response ensues. The most commonly referenced uncertainties are polar 
angle confusions. 

Polar angle confusions are classified using a traditional segmentation of the cone- 
of-confusion [73, 92], revised in [108]. The classification results in three potential 
confusion types, front-back, up-down and combined, with a fourth type correspond- 
ing to precision errors, represented schematically in Fig. 4.6b. The precision category 
designates any response close enough to the real target so as not to be associated to 
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Fig. 4.7 Result analysis by subgroup. a Mean absolute polar angle error and 95% confidence 
intervals for groups W10+, W10- and C10 across sessions 1-10. b Response classification analysis: 
Mean classification of results for group W10 by type (precision (x ), front-back error (O), up-down 
error (Y) and combined error (A)) for groups W10+ (—, 3 subjects) and W10- (- -, 5 subjects) 
over sessions 1—10 (from [92]) 


the other confusion types. In short, responses classified under precision are for those 
within +45° of the target angle, front-back classified errors are responses reflected in 
the frontal plane, and those classified up-down are for those reflected in the transverse 
plane. Any responses that fall outside of these regions are classified as combined type 
errors. 


Performance Evaluation Results 


Results examined the evolution of polar angle error and confusion rates. As a measure 
of accommodation, the rate of improvement was defined as the gradient of the linear 
regression of polar angle error. The rates of improvement for the 8 subjects spanned 
values of 0.5° to 4.6°/session over sessions 5—10 (as results for initial sessions have 
been shown to be influenced by procedural learning effects [59]). In contrast, results 
for the control group over the same sessions spanned 0° to 2.2°/session. A clustering 
analysis of the test group relative to the control group, C10, separated those whose 
rate of improvement exceeded that of the control group (subgroup W10+) and the 
remaining subjects (W10-) who did not. This second group failed to exhibit clear 
HRTF adaptation results over and above that of the control group whose improvement 
can be considered primarily as procedural learning. 

The polar errors are shown in Fig. 4.7a for groups W10+, W10- and C10. Group 
W10+ approached a similar level of absolute performance to C10. This demonstrates 
that these subjects were able to adapt well to their worst-rated HRTF to a level 
approaching subjects using their individually measured one. It also shows clearly that, 
despite continuous training, some subjects (W10-) exhibited little or no improvement 
beyond the procedural learning seen in C10. 
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The response classification results for groups W10+ and W10- are shown in Fig. 
4.7b. At the outset of the study, it can be observed that up-down and front-back 
type error rates are comparable between the two subgroups, with W10- exhibiting 
more combined type errors. This metric could be a potential indicator for identifying 
poor HRTF adaptation conditions. Subsequently, it can be clearly seen that group 
W10+ exhibits a steady increase in precision classified responses, with reductions in 
front-back errors over sessions 3—5 and subsequent reductions in combined errors. 
In contrast, group W10- exhibits generally consistent response classifications across 
sessions, with only small increases in precision classification mirrored by a decreas- 
ing trend in front-back errors. For all subjects, it can be noted that the occurrence of 
up-down errors is quite rare. 

Results of this accommodation study show that adaptation to an individual’s per- 
ceptually worst-rated HRTF can continue as long as training is provided, though the 
rate of improvement decreases after a certain amount of training. A subgroup achiev- 
ing localisation performance levels approaching the control group with individual 
HRTFs. These performance levels were comparable to those observed in [73] with 
identical test protocol, where subjects performed only three training sessions using 
their perceptually best rated HRTF. 


4.4 Discussion 


It is clear that, while various methods and tools are available for selecting a best fit 
HRTF for a given listener, there is no established evaluation protocol to determine 
how well these methods work and compare with each other. While some work is 
advancing in proposing common methodologies and metrics [75], the lack of estab- 
lished methods raises some very relevant questions about the feasibility of a unique 
HRTF selection task which performs reliably and independently from factors such 
as the listeners expertise, the signals employed, the user interface, the context where 
the tests are carried out and, more in general, the task for which the final quality 
is judged. It seems evident that any major leap forward in this field is limited until 
two primary issues are addressed: (1) the establishment of pertinent metrics to per- 
ceptually assess HRTFs and (2) the relationship between these metrics and specific 
characteristics of the signal domain HRTF filters. 

The use of HRTF adaptation, in examining the results of this and previous studies, 
has been shown to be a viable option to improve spatial audio rendering, at least 
with regard to localisation. The level of adaptation achievable is related to the initial 
suitability (perceptual similarity) between the system HRTF and the user’s individual 
HRTF, with more suitable HRTFs showing more rapid adaptation. No significant 
effect has been found regarding the specific training intervals, though spreading out 
sessions is better than multiple sessions on the same day. The adaptation method 
could be integrated into a stand-alone game application, or as part of device setup 
and personalization configurations, typical of most VR devices to some degree. The 
major limitation, once the training HRTF is chosen, is the need for repeated training 
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Fig. 4.8 Example active HRTF learning training game. Training setup: (top-left) participant in 
the experiment room, (bottom-left) third person view of the training platform, (right) participant 
viewpoint during the training (from [77]) 


sessions, and this must be made clear to users so that they do not expect ideal results 
from the start. 

The combination of user-to-system and system-to-user adaptation is a promis- 
ing solution. While user-to-system adaptation appears limited by the initial training 
HRTF employed, system-to-user adaptation methods provide various means of pro- 
viding, if not a perfect individual HRTF, a reasonable near approximation. As such, 
selection of a pretty-good HRTF match followed by user training could be a viable 
real-world solution. 

An example of such a tailored HRTF training has been tested in [77]. In this work, 
as compared to the previous mentioned study in Sect. 4.3.2, the subject was aware of 
the goal of the training, with specific HRTF-based localisation difficulties presented 
with increasing difficulty (see Fig. 4.8). In addition, a best match HRTF condition was 
employed using an interactive exploration method, rather than the general ranking 
described in Sect. 4.2.2 and a worst-case selection scenario. Results indicated that 
the proposed training program led to improved learning rates compared to that of 
previous studies. A further addition of this study was the inclusion of a simulated 
room acoustic response, moving from the typical anechoic conditions of previous 
studies to a more natural acoustic for the user. Results showed that the addition of 
the room acoustics improved HRTF adaptation rate across sessions. 


4.5 Conclusions and Future Directions 


While binaural audio and spatial hearing have been studied for over 100 years, major 
advancements in these fields have occurred in the last two to three decades, possi- 
bly thanks to progress in real-time computing technologies. It has been extensively 
shown that everyone perceives spatial sound differently thanks to the particular shape 
of their ears, head and torso. For this reason, either high-quality simulations need to 


136 L. Picinali and B. F. G. Katz 


be uniquely tailored to each individual listener, or the listener needs to adapt to the 
configuration (i.e. the HRTF on offer) of the rendering system, or again some combi- 
nation of both using individualised HRTFs. This chapter has provided an overview of 
research aimed at systematically exploring, assessing and validating various aspects 
of these two approaches. But while there is a good level of agreement on certain 
notions and principles, e.g. that using non-individual HRTFs can result in impaired 
localisation performance which can however be improved through perceptual train- 
ing, there are still open challenges in need of further investigation. 

A rather general but very important question that has yet to be addressed is how 
we can measure whether a simulated immersive audio experience is suitable and of 
sufficient quality for a given individual. Previous work has established a certain level 
of standardisation for assessing general audio quality (e.g. related to telecommuni- 
cation and audio compression algorithms), but equivalent work has yet to be carried 
out in the field of immersive audio. Objective and subjective metrics for assessing 
HRTF similarity have been explored and evaluated in the past [5], and recently pub- 
lished research suggests that additional metrics might exist, e.g. looking at speech 
understanding performance [21] or machine learning artificial localization tests [3, 
13]. Nevertheless, extensive research is still needed in order to understand and model 
low-level psychophysical (sensory) as well as high-level psychological (cognitive) 
spatial hearing perception. 

Factors other than choices related to binaural audio processing could also have an 
impact on the overall perception of the rendered scenes. The fact that high-quality, 
albeit non-interactive, immersive audio rendering can be achieved through record- 
ings done with a simple binaural microphone, which by definition do not account for 
individualised HRTFs, can be considered an example of the major complexity and 
dimensionality of the problem. Matters such as the choice of audio content, the con- 
text of the rendered scene, as well as the experience of the listener (e.g. whether they 
have previously participated in immersive audio assessments) have been shown to be 
relevant when assessing the perceived quality of the immersive audio rendering [6, 
54]. Such a discussion found a natural continuation in Chap. 5. 

Looking more in depth at the need to quantify the individually perceived quality 
of the rendering, the understanding of the perceptual weighting of morphological 
factors contributing to spatial hearing becomes an essential target to be achieved. 
Data-based machine learning approaches may be a useful tool when tackling this, as 
well as challenges related to user-to-system adaptation. Examples include allowing a 
certain level of customisation of the training by individually and adaptively varying 
the difficulty of the challenge, maximising learning and at the same time avoiding 
an overload of sensory and cognitive capabilities. Further explorations on spatial 
hearing adaptation shall focus on exploring the transferability of the acquired training 
between different hearing skills (e.g. [100]) and examining to what extent spatial 
auditory training performed in VR is transferable to real-life tasks. 

Another very relevant yet still under-explored area of research is employing cog- 
nitive and psycho-physiological measurements when trying to assess both the quality 
of rendered spatial hearing cues and the cognitive effort during HRTF training. In 
the first case, measures related with behavioural performance, as well as electroen- 


4 System-to-User and User-to-System Adaptations in Binaural Audio 137 


cephalographic markers of selective attention, could be used to assess the suitability 
of immersive rendering choices [23], possibly opening the path towards passive 
perceptual-based HRTF selection. In the second case, similar metrics, with the addi- 
tion of other measures of listening effort such as pupil dilation [103], could be 
employed for customising spatial hearing training routines, maximising outcomes 
while maintaining engagement and feasibility of the proposed tasks. 


Final Thoughts 


While most studies have focused on laboratory conditions to isolate specific percep- 
tion elements, recent context-relevant studies have begun to examine the impact of 
spatial audio quality on task accomplishment. For example, [76] compared perfor- 
mance in a first-person-shooter VR game context with different HRTF conditions. 
Results showed performance for extreme elevation target positions was affected by 
the quality of HRTF matching. In addition, a subgroup of participants showed higher 
sensitivity to HRTF choice than others. At the same time, low-level sensory percep- 
tion is only one of the dimensions where immersive audio simulations can have a 
significant impact. In order to significantly advance our understanding of the impact 
of HRTF personalisation in virtually rendered scenes and tasks, research needs to 
move beyond the evaluation of individual immersive audio tasks and metrics (e.g. 
sound localisation and/or perceived quality of the rendering), moving towards the 
evaluation of full experiences. The impact of immersive audio beyond perceptual 
metrics such as localisation, externalisation and immersion [87] is an as yet unex- 
plored area of research, specifically when related with social interaction, entering 
the behavioural and cognitive realms. 

In the past, several studies have been published in which auditory-based AR/VR 
interactions were created and evaluated without considering HRTF choice or using 
HRTF personalisation approaches that had not previously been appropriately vali- 
dated from a perceptual point of view, or again ignoring the effects of HRTF accom- 
modation, or blaming them in order to justify unexpected results. Considering our 
current knowledge and experience in immersive audio research, we are keen to rec- 
ommend carrying out some level of personalisation of the spatial rendering when per- 
forming studies which involve auditory-based or multimodal interactions in AR/VR. 
As a baseline, ITDs can easily be customised to match the head circumference of 
the specific listener (as mentioned above, this function is already implemented in 
a few spatialisers, such as [22, 78]). Furthermore, HRTF selection routines, both 
perceptual and morphology based, could be very beneficial if carried out before the 
experiment, albeit it is important for the repeatability of such choices to be assessed 
with the specific subject (i.e. repeating the selection several times in order to ver- 
ify the consistency across the trials, and possibly discard subjects/methods which 
do not show a sufficient level of repeatability). Regarding the use of synthesised 
HRTFs, until these are validated through extensive perceptual studies our advice 
is to use measured ones, possibly coming from the same dataset in order to avoid 
measurement-based differences. 
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In addition to these recommendations, it is important to emphasize that the future 
of immersive audio research will need to include studies focusing on different con- 
texts (e.g. AR/VR interactions, virtual museum explorations and virtual assistant 
avatars), exploring the impact (and need) of HRTF personalisation on complex tasks 
such as interpersonal exchanges and distance learning in VR. Furthermore, in order 
to ensure a sufficient level of standardisation and consistently advance the achieve- 
ments of research in this area, it seems evident that a concerted and coordinated effort 
across disciplines and research groups is highly desirable. 
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Chapter 5 A) 
Audio Quality Assessment for Virtual get 
Reality 


Fabian Brinkmann and Stefan Weinzierl 


Abstract A variety of methods for audio quality evaluation are available ranging 
from classic psychoacoustic methods like alternative forced-choice tests to more 
recent approaches such as quality taxonomies and plausibility. This chapter intro- 
duces methods that are deemed to be relevant for audio evaluation in virtual and 
augmented reality. It details in how far these methods can directly be used for testing 
in virtual reality or have to be adapted with respect to specific aspects. In addition, 
it highlights new areas, for example, quality of experience and presence that arise 
from audiovisual interactions and the mediation of virtual reality. After briefly intro- 
ducing 3D audio reproduction approaches for virtual reality, the quality that these 
approaches can achieve is discussed along with the aspects that influence the quality. 
The concluding section elaborates on current challenges and hot topics in the field of 
audio quality evaluation and audio reproduction for virtual reality. To bridge the gap 
between theory and practice useful resources, software and hardware for 3D audio 
production and research are pointed out. 


5.1 Introduction 


Over the past years, an increasing number of virtual and augmented reality (VR/AR) 
applications emerged due to the advent of mobile devices such as smartphones and 
head-mounted displays. Audio plays an important role within these applications that 
is by far not restricted to conveying semantic information, for example, through 
dialogues or warning sounds. Beyond that, audio holds information about the spa- 
ciousness of a scene including the location of sound sources and the reverberance 
or size of a virtual environment. In this way, audio can be regarded as a channel 
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to provide semantic information and spatial information and improve the sense of 
presence and immersion at the same time. Due to the key role of audio in VR/AR, 
this chapter gives an overview of methods for audio quality assessment in Sect. 5.2, 
followed by a brief introduction of audio reproduction techniques for VR/AR in 
Sect. 5.3. Readers who are familiar with audio reproduction techniques might skip 
Sect. 5.3 and directly continue with Sect. 5.4 that gives an overview of the quality of 
existing audio reproduction systems. 


5.2 Perceptual Qualities and Their Measurement 


Methods and systems for generating virtual and augmented environments can be 
understood as a special case of (interactive) audio reproduction systems. Thus, in 
principle, all procedures for the perceptual evaluation of audio systems can also 
be used for the evaluation of VR systems [6]. These include the procedures for 
the evaluation of “Basic Audio Quality”, which are standardized in various ITU 
recommendations and focus on the technical system properties and signal processing, 
as well as approaches with a wider focus on the listening situation and the presented 
audio content, taking into account the “Overall Listening Experience”. In addition, 
a number of measures have recently been proposed to more specifically determine 
the extent to which technologies for virtual and augmented environments live up to 
their claim of providing a convincing equivalent to physical acoustic reality. Finally, 
in addition to these holistic measures for evaluating VR and AR, there are a number 
of (VR-specific and VR-nonspecific) quality inventories that can be used to perform 
a differential diagnosis of VR systems, highlighting the individual strengths and 
weaknesses of the system and drawing conclusions for the targeted improvement. 


5.2.1 Generic Measures 


5.2.1.1 Basic Audio Quality 


Since the mid-1990s, the Radiocommunication Sector of the International Telecom- 
munication Union (ITU-R) has developed a series of recommendations for the “Sub- 
jective assessment of sound quality”. The series includes an overview of the areas 
of application of the recommendations with instructions for the selection of the 
appropriate standard [35] as well as an overview of “general methods” which are 
applied slightly differently in the different standards [36]. They contain instructions 
for experimental design, selection of the listening panel, test paradigms and scales, 
reproduction devices, and listening conditions up to the statistical treatment of col- 
lected data. Originally, these recommendations were mainly used for the perceptual 
evaluation of audio codecs, but later, they were also used for the evaluation of multi- 
channel reproduction systems and 3D audio techniques. The central construct to be 
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Fig. 5.1 User interfaces for ABC/HR and MUSHRA tests. Active conditions are indicated by 
orange buttons; loop range and current playback position by orange boxes and lines. The ABC/HR 
interface shows only one condition but versions with multiple conditions per rating screen are also 
possible. If multiple conditions are displayed on a single screen, an additional button to sort the 
conditions according to the current ratings might help subjects to establish more reliable ratings 
(CC-BY, Fabian Brinkmann) 


evaluated by all ITU procedures is “Basic Audio Quality” (BAQ). It can be eval- 
uated either by direct scaling or by rating the “impairment” relative to an explicit 
or implicit reference and caused by deficits of the transmission system such as a 
low-bitrate audio codec or by limitations of the spatial reproduction. By definition 
BAQ includes “all aspects of the sound quality being assessed”, such as “timbre, 
transparency, stereophonic imaging, spatial presentation, reverberance, echoes, har- 
monic distortions, quantisation noise, pops, clicks and background noise” [36, p. 7], 
In studies of impairment, listeners are asked “to judge any and all detected differ- 
ences between the reference and the object” [34, p. 7]. In this case, the evaluation of 
BAQ thus corresponds to a rating of general “similarity” or “difference”. 
The most popular standards for BAQ are (cf. Fig. 5.1) 


e ITU-R BS. 1116-3:2016 (Methods for the subjective assessment of small impair- 
ments in audio systems) [34]. Listeners are asked to rate the difference between 
an audio stimulus and a given reference stimulus using a continuous scale with 
five labels (“Imperceptible’’/“Perceptible, but not annoying’/“Slightly annoying”’/ 
“Annoying’/“Very annoying”) used as “anchors”. Participants are presented with 
three stimuli (A, B, C). A is the reference, and B and C are rated, with one of the 
two stimuli again being the hidden reference (double-blind triple-stimulus with 
hidden reference). 

e ITU-R BS.1534 (Method for the subjective assessment of intermediate quality 
level of audio systems) [37]. Unlike ITU-R BS. 1116-3, it is a multi-stimulus test 
where direct comparisons between the different stimuli are possible. Quality is 
rated on a continuous scale with five labels (““Excellent’/“Good’/“Fair’/“Poor’/ 
“Bad”). Participants are presented with a reference, no more than nine stimuli 
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under test, and two anchor signals (MUlIti-Stimulus test with Hidden Reference 
and Anchor, MUSHRA). The standard anchors are a low-pass filtered version of 
the original signal with a cut-off frequency of 3.5 kHz (low-quality anchor) and 
7 kHz (mid-quality anchor). Alternatively or additionally, further non-standard 
anchors can be used; they should resemble the character of the systems’ artifacts 
being tested and indicate how the systems under test compares to well-known 
audio quality levels. Possible anchors in the context of spatial audio might be 
conventional mono/stereo recordings or non-individual signals. Since listeners 
can directly compare the signals under test with the reference and among each 
other, more reliable ratings can be expected in situations where stimuli differ 
significantly from the reference, but only slightly from each other. 


Although BAQ is the standard attribute to be tested in both ITU recommendations, 
other attributes are suggested to test more specific aspects of audio systems such as 
spatial and timbral qualities. ITU-R BS.1284-2 contains a list of main attributes and 
sub-attributes, from which one can choose those suitable for a particular test [36, 
Attachment 1]. In this respect, both recommendations are often used only as an 
experimental paradigm, but applied to qualities other than BAQ, e.g., those developed 
in various taxonomies on the properties of VR systems (see Sect. 5.2.2.4). 

A number of issues were raised addressing specific aspects of the ITU recom- 
mendations [55]. One pertains to the scale labels being multidimensional, which 
could distort the ratings. This can be avoided by using clearly unidimensional labels 
at both ends, e.g., “imperceptible’/“very perceptible” for ABC/HR or “good’’/“bad” 
for MUSHRA and additional unlabeled lines for orientation. Another issue points out 
that data from MUSHRA tests often violate assumptions for conducting an Analysis 
of Variance (ANOVA), the most common means for statistical analysis of the results. 
This can be considered by using general linear models for the analysis, that are more 
flexible than ANOVA and pose less requirements on the input data [33]. 


5.2.1.2 Overall Listening Experience 


The construct of “Overall Listening Experience” (OLE) [70] was derived from the 
concept of “Quality of Experience”, which in the context of quality management 
describes “the degree of delight or annoyance of the user of an application or ser- 
vice” [11], considering not only the technical performance of a system but also the 
expectations and personality and current state of the user as influencing factors. In 
contrast to listening tests according to the ITU recommendations, the musical content 
is thus explicitly part of the judgment that listeners make about the OLE. 

A measurement of the OLE can be a useful alternative or supplement to purely 
system-related evaluations insofar as, for example, the difference between different 
playback systems for music may very well be audible in a direct comparison, but 
hardly relevant for everyday music consumption, also in comparison to the liking 
of the music played. In this respect, an evaluation according to ITU may possibly 
convey a false picture of the general relevance of technical functions. This becomes 
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Fig. 5.2 Results of a listening test (z-standardized scores) of Basic Audio Quality (BAQ) and Over- 
all Listening Experience (OLE) for three different spatial audio systems (2.0 stereo, 5.0 surround, 
22.2 sound referred to as “3D Audio”). BAQ ratings were given according to ITU-R BS.1534 rela- 
tive to the “3D audio” condition as an explicit reference, whereas OLE ratings were given without 
a reference stimulus [71, p. 84] 


evident, for example, in a direct comparison between BAQ and OLE ratings of spatial 
audio systems, where the differences between BAQ ratings are generally larger than 
between OLE ratings. In a listening test, both BAQ ratings according to ITU-R BS. 
1534 with explicit reference and OLE ratings (“Please rate for each audio excerpt 
how much you enjoyed listening to it”) without explicit reference were collected for 
three different spatial audio systems (2.0 stereo, 5.0 surround, 22.2 surround [71]). 
While the difference between 2.0 and 5.0 was equally visible in BAQ and OLE, the 
difference between 5.0 and 22.2 was clearly audible in a direct comparison (BAQ), 
but did obviously not result in a significant increase in listening pleasure (OLE, 
Fig. 5.2). 


5.2.2 VR/AR-Specifc Measures 


5.2.2.1 Authenticity 


A simulation that is indistinguishable from the physical sound field it is intended to 
simulate could be termed authentic. The term could be used in a physical sense; then 
it would aim at the identity of sound fields, be it the identity of sound pressures in the 
ear canal (binaural technology) or the identity of sound fields in an extended spatial 
area (sound field synthesis). Since no technical system is currently able to guarantee 
such an identity, and since such a physical identity may also not be required for the 
users of VR/AR systems, the term authenticity is mostly used in the psychological 
sense. In this sense, it denotes a simulation that is perceptually indistinguishable 
from the corresponding real sound field [8]. 
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The challenge in determining perceptual authenticity is not to let the presence 
of a simulation or the physical reference in the listening test become recognizable 
solely through the environment of the presentation, i.e., by wearing headphones 
as opposed to listening freely in the physical sound field, or by listening in a studio 
environment that does not correspond to the simulated space even purely visually. For 
this reason, a determination of the authenticity of loudspeaker-based systems such 
as Wave Field Synthesis (WFS) or Higher-Order Ambisonics (HOA) can hardly be 
carried out in practice, because even if one were to suppress the visual impression 
by means of a blindfold, the listener would have to be brought from the playback 
room of the synthesis into the real reference room, which would no longer allow a 
direct comparison due to the temporal delay. Setting up a sound field synthesis in the 
corresponding physical room, on the other hand, would be prohibited, since the room 
acoustics of the physical room would influence the sound field of the loudspeaker 
synthesis. 

A determination of authenticity is simpler for binaural technology systems. By 
using open headphones that are largely transparent to the external sound field and 
whose influence can possibly be compensated by an equalization filter, a direct com- 
parison can be made by switching back and forth between a physical sound source 
and its binaural simulation [8]. The influence of the headphones on the external 
sound field can be further minimized by using extra-aural headphones suspended a 
few centimeters in front of the ear [18]. Such an influence can also come from other 
VR devices such as head-mounted displays that are close to the ear canal [27]. An 
example of a listening test setup is shown in Fig. 5.3. 

As aparadigm for the listening test, classical procedures such as ABX with explicit 
reference [12, 44] or forced-choice procedures (N-AFC) with non-explicit reference 
[21] can be used, which have proven suitable for detecting small differences between 
two stimuli. It should be noted that, especially in the case of minor differences, the 
presentation mode can have a great influence on the recognition rate, such as the 
fact whether the two stimuli (simulation and reference) can be heard by the test 


Fig. 5.3 Listening test setup 
for testing authenticity and 
plausibility. For seamless 
switching between audio 
from the loudspeakers and 
their binaural simulation, the 
subject is wearing extra-aural 
headphones that minimize 
distortions of exterior sound 
fields. The head position of 
the subject is tracked by an 
electromagnetic sensor pair 
mounted on the top of the 
chair and headphones. See 
also Sect. 5.4.1.1 (CC-BY, 
Fabian Brinkmann) 
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participants only once or as often as desired [8, p. 1793 f]. An example of a user 
interface is given in Fig. 5.4. 

Binaural representations can also be used to make comparisons of physical sound 
fields and simulations based on loudspeaker arrays [85]. For this purpose, the mea- 
sured or numerically simulated sound field of a loudspeaker array at a given listening 
position can be presented in the listening test as a binaural synthesis, thus avoiding 
the problems described above when comparing physical and loudspeaker-simulated 
sound fields. It should be noted, however, in this case, the simulation (binaural syn- 
thesis) of a simulation (sound field synthesis) becomes audible, so it may be difficult 
to separate the artifacts of the two methods. 


5.2.2.2 Plausibility 


While the authenticity of virtual environments can be determined by the (physical 
or perceptual) identity of physical and simulated sound fields, plausibility has been 
proposed as a measure of the extent to which a simulation is “in agreement with 
the listener’s expectation towards a corresponding real event” [47]. Plausibility thus 
does not address the comparison with an external, presented reference, but the con- 
sideration against the background of an inner reference that reflects the credibility 
of the simulation, based on the listener’s experience and expectations of the internal 
structure of acoustic scenes or environments. The operationalization of this construct 
thus does not require a comparative evaluation, but a yes—no decision. 

By analyzing such yes—no decisions with the statistical framework of signal detec- 
tion theory (SDT, [84]), one can separate the response bias, i.e., a general, subjective 
tendency to consider stimuli as “real” or “simulated”, from the actual impairments of 
the simulation. Signal detection theory is originally a method for determining thresh- 
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old values. For example, the absolute hearing threshold of sounds can be determined 
by the statistical analysis of a 2x2 contingency table in which two correct answers 
(sound present and heard, sound absent and not heard, i.e., hits and correct rejections) 
and two incorrect answers (sound present and not heard, sound absent and heard, 
i.e., misses and false alarms) occur. By contrasting these response frequencies, the 
response bias, i.e., a general tendency to mark sounds as “heard,” can be separated 
from actual recognition performance. The latter is represented by the sensitivity d’ 
which can be converted to a corresponding 2AFC detection rate. A number of at 
least 100 yes—no decisions per subject is considered necessary for obtaining stable 
individual SDT parameters [40]. 

This approach can be applied to the evaluation of virtual realities, in that the 
artifacts caused by deficits in the simulation take on the role of a stimulus to be 
discovered, and listeners are asked to identify the environment as “simulated” if they 
notice them. The prerequisite for such an experiment is, however, that—similar to an 
experiment on “authenticity” —one can present both physically “real” and simulated 
sound fields without the nature of the stimulus already being recognizable on the basis 
of the experimental environment, for example, by providing a visual representation of 
the physical sound source also in the simulated case, or by conducting the experiment 
with closed or blindfolded eyes. 


5.2.2.3 Sense of Presence and Immersion 


A central function of VR systems is to create a “sense of presence”, i.e., the feeling of 
being or acting in a place, even when one is physically situated in another location and 
the sensory input is known to be technically mediated. The concept of presence, also 
called “telepresence” in older literature in reference to teleoperation systems used to 
manipulate remote physical objects [58], has given rise to its own research direction 
and community in the form of presence research, which is organized in societies such 
as the International Society for Presence Research (ISPR) and conferences such as 
the biennial PRESENCE conference.! 

To measure the degree of presence, different questionnaires have been developed. 
For an overview see [72]. The instrument of Whitmer and Singer [87], one of the most 
widely used questionnaires, contains 28 questions such as “How much were you able 
to control events?”, “How responsive was the environment to actions that you initiated 
(or performed)?”, “How natural did your interactions with the environment seem?”, 
or “How completely were all of your senses engaged?”. Analyzing the response pat- 
terns in these questionnaires, different dimensions such as “Involvement”, “Sensory 
fidelity”, “Adaptation/immersion’’, and “Interface quality” have emerged in factor 
analytic studies [86]. 

Other approaches to measuring presence include behavioral measurements. If one 
assumes that presence is given if the reactions to a virtual environment correspond 
to the behavior in physical environments, then for example, the swaying caused by 
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a moving visual scene or ducking in response to a flying object can be used as an 
indicator for the degree of presence [19]. As a prerequisite for such realistic behavior, 
Slater considers two aspects: The sensation of being in a real place (“place illusion”) 
and the illusion that the scenario being depicted is actually occurring (“plausibility 
illusion”) [75]. Note, however, that “plausibility” is used here, in comparison with 
the understanding used in Sect. 5.2.2.2, in a narrower sense with a slightly different 
meaning. 

A similar idea is behind the use of psychophysiological measures. If the normal 
physiological response of a person to a particular situation is replicated in a VR 
environment, this can be considered as an indicator of presence. Although physio- 
logical parameters have been used to measure various functions and applications of 
VR systems [28], they have also been used to measure presence in several studies. 
Depending on the scenario presented, the Electroencephalogram (EEG) [5], heart 
rate (HR) [14], or skin conductance and heart rate variability [13] were shown to be 
indicators of different degrees of presence. The exact correlations, however, seem 
to depend very much on the scenario presented in each case, and in any case, com- 
parative values from a corresponding real-life stimulus are required to calibrate the 
measurement. Also breaks in presence (BIPs), i.e., moments where the users become 
aware of the mediatedness of the VR experience due to shortcomings of the system 
becoming suddenly obvious seem to be associated with physiological responses [76]. 

In general, these approaches seem to be limited to situations in which physiological 
reactions are sufficiently pronounced, such as anger, fear, or stress [54], whereas 
reactions are less pronounced when the person is predominantly an observer of a 
scene that has little emotional impact. This may be the reason why manipulations 
to the level of presence in these studies were almost exclusively realized through 
changes to the visual display and user interaction, while physiological parameters 
were hardly used to evaluate the degree of presence in acoustic virtual environments. 

The sense of presence, long used as a measure for evaluating VR and AR sys- 
tems alone, has recently gained increasing attention as a general neuropsychological 
phenomenon evolving from biological as well as cultural factors [68]. From the 
perspective of evolutionary psychology, the sense of presence has evolved not to 
distinguish between real and virtual conditions, but to distinguish the external world 
from phenomena attributable to one’s own body and mind. On such a theoretical 
basis, it seems consequent that for achieving a high presence not only the sensory 
plausibility and the naturalness of the interaction but also the meaning and relevance 
of the scene for the respective user is essential. The degree of presence in a virtual 
scene will remain limited if the content is irrelevant to the respective user [66]. 

Related to the sense of presence, but less consistently used, is the concept of 
“immersion”. In some literature, it is treated as an objective property of VR and 
AR systems [77]. According to this technical understanding, a 5-channel system is 
considered more “immersive” than a two-channel system, simply because it is able 
to present a wider range of sound incidence directions to the listener. In other works, 
however, immersion is treated as psychological construct, i.e., a human response to 
a technical system [87], shifting the meaning of “immersion” closer to the concept 
of presence [74]. Finally, in many works, especially in the field of audio, it remains 
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unclear whether the reasoning about immersion is on a technical or psychological 
level. Chapter 11 discusses more in depth the aforementioned issue focusing on 
audiovisual experiences. 


5.2.2.4 Attributes and Taxonomies 


With properties such as authenticity, plausibility, or the sense of presence, a global 
assessment of VR systems is intended. In order to obtain indications of the strengths 
and weaknesses of these systems and to draw appropriate conclusions for improve- 
ment, however, a differential diagnosis is required that separately assesses different 
qualities of the respective systems. To distinguish these perceptual qualities from 
technical parameters of the system that may have an influence on them, the former 
is also referred to as “Quality features” and the latter as “Quality elements” in the 
Context of Product-Sound Quality [38]. 

For this purpose, different taxonomies for the qualities of virtual acoustic envi- 
ronments, 3D audio or spatial audio systems have been developed. Some of these 
are based on earlier collections of attributes for sound quality and spatial audio qual- 
ity [42] which were clustered in sound families using semantic analyses such as 
free categorization or multidimensional scaling (MDS) [43]. Pedersen and Zacharov 
(2015) [62] developed a sound wheel to present such a lexicon for reproduced sound.” 
The wheel format has a longer tradition in the domain of food quality and sensory 
evaluation [60] as a structured and hierarchical form of a lexicon of different sensory 
characteristics. The selection of the items and the structure of the wheel in [62] are 
based on empirical methods such as hierarchical cluster analysis and measures for 
discrimination, reliability, and inter-rater agreement of the individual items. 

While the taxonomies mentioned above were developed for spatial audio sys- 
tems and product categories such as headphones, loudspeakers, multi-channel sound 
in general, others were generated with a stronger focus on virtual acoustic envi- 
ronments. Developed by qualitative methods such as expert surveys (DELPHI 
method [73]) and expert focus groups [48], they contain between 7 [73] and 48 
attributes [48], from which those relevant to the specific experiment can be selected. 
Examples of a VR/AR specific taxonomy and a rating interface are shown in 
Figs. 5.5 and 5.6. 


5.2.3. VR/AR-Specific User Interfaces, Test Procedures, 
and Toolkits 


While the quality measures introduced so far can theoretically be directly transferred 
for testing in VR and AR, there are specific features that should be addressed: The 


2 Currently maintained under https://forcetechnology.com/en/articles/gated-content-senselab- 
sound- wheel (last access 2022/06/17). 
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Fig. 5.5 SAQI wheel for the evaluation of virtual acoustic environments, structured into informal 
categories (inner ring) and attributes (outer ring). For definitions and sound examples refer to 
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Fig. 5.6 User interface for 
conducting a SAQI test. The 
interface is similar to that of 
a MUSHRA test shown in 
Fig. 5.1 with the difference 
that the current quality to be 
rated is given together with 
the possibility to show its 
definition (info button) and 
that the rating scale can also 
be bipolar. In any case, zero 
ratings indicated no 
perceivable difference 
(CC-BY, Fabian Brinkmann) 
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test method and interface, the technical administration of the test, and the effect of 
added degrees of freedom on the subjects. 

First, most of the test methods and user interfaces were developed to be accessed 
on a computer with a mouse as a pointing and clicking device. The rating procedure 
and the elements on the user interface might thus not be optimal for testing in VR/AR. 
This might be less relevant for simple paradigms such as ABX or yes/no tests but 
can certainly become an issue for rating the quality of multiple test conditions. 

Two approaches were suggested to account for this. Volker et al. [81] suggested a 
modified MUSHRA to simplify the rating interface and make it easier to establish an 
order between test conditions, especially if many test conditions are to be compared 
against the reference and each other (cf. Fig. 5.7). The idea is to unify playback 
and rating by making use of drag and drop actions, where the playback is triggered 
when the subject drags a button corresponding to a test condition, and the rating is 
achieved by dropping the button on a two-dimensional scale. Ratings obtained with 
the modified interface were comparable to those obtained with the classic interface 
in terms of test-retest reliability and discrimination ability. At the same time, the 
modified interface was preferred by the subjects, and subjects needed less time to 
complete the rating task. Note that the Drag and Drop MUSHRA could be easily 
adapted for testing quality taxonomies introduced in Sect. 5.2.2.4. 

A VR/AR-tailored approach to further simplify the rating procedure and interface 
was suggested by Rummukainen et al. [67]. They designed a simple and easy-to- 
operate interface, where the subject eliminates the conditions one after another in 
the order from worst to best (cf. Fig. 5.8). The elimination constitutes a rank order 
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between the stimuli from which interval scaled values—similar to Basic Audio Qual- 
ity ratings—were obtained by fitting Plackett-Luce models to the ranking vectors. 
As with the Drag and Drop MUSHRA, the elimination task could be adapted for 
testing against a reference and using taxonomies. 

Classic tests of Basic Audio Quality are most often conducted for (static) audio- 
only conditions and a variety of software solutions is available to conduct such 
tests [6, Sect. 9.2.3]. In contrast, tests in VR/AR require the experimental control of 
complex audiovisual scenes. In addition, the display of rating interfaces might affect 
the Quality of Experience (QoE) of interactive environments due to their potentially 
negative effect on the perceived presence [65]. An emerging tool to account for these 
aspects of AR/VR is the Quality of Experience Evaluation Tool (Q.ExE) currently 
developed by Raake et al. [65]. 

A third VR/AR-specific aspect is the possibility of freely exploring an audiovisual 
scene in six degrees of freedom (6DoF). Introducing 6DoF clearly affects the rating 
behavior of subjects [67] and might thus be considered problematic at first glance. An 
unrestricted 6DoF exploration is, however, the most realistic test condition. While 
this might introduce additional variance in the results, it might also be argued that 
results are more comprehensive and reflect more aspects of the audiovisual scene 
due to free exploration. Whether or not the exploration should be restricted will thus 
ultimately depend on the aim of an investigation. 


5.3 Audio Reproduction Techniques 


Two fundamentally different paradigms can be distinguished in audio reproduction 
for VR/AR that can be illustrated with the help of Fig. 5.9. The picture shows a 
simple sound field of a point source being reflected by an infinite wall. 

The first paradigm is to reproduce the entire sound field in a controlled zone, 
which has two advantages. First, multiple listeners can freely explore the sound field 
at the same time, and second, the reproduction is already individual as every lis- 
tener naturally perceives the sound through their own ears. However, there are three 
disadvantages. First, reproducing the entire sound field requires tens or hundreds of 
loudspeakers depending on the reproduction algorithm and the size of the listening 
area. Second, it requires an acoustically treated environment to avoid detrimental 
effects due to reflections from the reproduction room itself. Third, it is often chal- 
lenging to achieve a correct reproduction covering the entire hearing range from 
approximately 20 Hz to 20 kHz. In the following, this reproduction paradigm will 
be referred to as sound field synthesis (SFS). 

The second paradigm is to only reproduce the sound field at the listeners’ ears. 
The three advantages of this approach are that it can be realized with a single pair 
of headphones or loudspeakers, that at least headphone-based reproduction does not 
pose any demands on the reproduction room, and that a broad frequency range can 
be correctly reproduced. In turn, two disadvantages arise. First, the position and 
head orientation of the listeners must be tracked to enable a free exploration of the 
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Fig. 5.9 Sound field of a 
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position. (CC-BY, Fabian 
Brinkmann) 


sound field. Second, the individualization of the ear signals is challenging. Often, 
the reproduced signals stem from a dummy head, which can cause artifacts such as 
coloration and increased localization errors in case the ears, head, and torso of the 
listener differ from the dummy head. This reproduction paradigm will be referred to 
as binaural synthesis in the following. 

It is interesting to see that the advantages and disadvantages of the two paradigms 
are exactly contrary thus generating a strong bond between the application and repro- 
duction paradigm, whereas binaural synthesis is the apparent option for any applica- 
tion on mobile devices, sound field synthesis is appealing for public or open spaces 
such as artistic performances and public address systems. The next sections will 
introduce the two paradigms in more detail. We focus on technical aspects but start 
with brief theoretical introductions to foster a better understanding of the subject as 
a whole. 


5.3.1 Sound Field Analysis and Synthesis 


The idea behind sound field analysis and synthesis (SFA/SFS) is to reproduce a 
desired sound field within a defined listening area using a loudspeaker array. The 
example in Fig. 5.10 shows this for the simple case of a plane wave traveling in the 
normal direction of a linear array. 

Two fundamentally different SFA/SFS approaches can be distinguished. Physi- 
cally motivated algorithms aim at capturing and reproducing sound fields physically 
correct, while perceptually motivated methods aim at capturing and synthesizing 
sound field properties that are deemed to be of high perceptual relevance. 


5.3.1.1 Sound Field Acquisition and Analysis 


Sound field synthesis requires a sound field that should be reproduced and there 
are two options for its acquisition: through measurement or simulation. Measured 
sound fields can have a high degree of realism and can, for example, be used for 
broadcasting concerts, while simulated sound fields offer more flexibility in the 
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design of the auditory scene and are thus often used in game audio engines (please 
refer to Chap. 3 for an introduction to interactive auralization). The description and 
evaluation of sound field simulation techniques is beyond the scope of the article and 
the interested reader is kindly referred to related review articles [10, 79]. 

Sound fields are usually measured through microphone arrays, i.e., spatially dis- 
tributed microphones that are in most cases positioned on the surface of a rigid or 
imaginary sphere. They can be used to directly record sound scenes such as concerts. 
In some cases, however, a direct recording will be limiting as it does not allow to 
change the audio content once the recording is finished. This can be realized if so- 
called spatial room impulse responses (SRIRs) are measured, i.e., impulse responses 
that describe the sound propagation between sound sources and each microphone of 
the array. 

A common method for physically motivated SFA is the plane wave decomposition 
(PWD), which applies Fourier Transforms with respect to time and space to the 
acquired sound field [64, Chap. 2]. It derives a spatially continuous description of the 
analyzed sound field containing information on the times and directions of arriving 
plane wave. If the analyzing array has sufficiently many microphones, PWD can 
yield a physically correct and complete description of the sound field. 

Popular approaches for perceptually motivates SFA are spatial impulse response 
rendering (SIRR), directional audio coding (DirAC), and the spatial decomposition 
method (SDM) [64, 78, Chaps. 4—6]. These approaches use a time-frequency analysis 
to extract the direction of arrival and in case of SIRR and DirAC also the residual 
diffuseness for each time-frequency slot. The intention of this is to extract these 
information from signals recorded with only a few microphones—typically between 
4 and 16—and reproduce the signals with an increased resolution using methods 
introduced in the following sections. SIRR and SDM only work with SRIRs, while 
PWD and DirAC also work with direct recordings. While SDM uses a broadband 
frequency analysis and extremely short time windows, the remaining methods use 
perceptually motivated time and frequency resolutions. SDM is able to extract a single 
prominent reflection per time window while the PWD and higher order realizations 
of SIRR and DirAC can detect multiple reflections in each time-frequency slot. 
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5.3.1.2 Physically Motivated Sound Field Reproduction 


The two methods for physically motivated sound field reproduction are wave field 
synthesis (WFS, works with linear, planar, rectangular, and cubic loudspeaker arrays) 
and near-field compensated higher order Ambisonics (NFC-HOA, works with cir- 
cular and spherical arrays) [1]. Both methods can reproduce plane waves and point 
sources by filtering and delaying the sounds for each loudspeaker in the array. In the 
simple case shown in Fig. 5.10, all loudspeakers play identical signals. Because of 
their high computational demand, WFS and NFC-HOA are rarely used with mea- 
sured sound fields that consist of hundreds of sources/waves. One possible approach 
is to use only a few point sources for the direct sound and early reflections, and a 
small number of plane waves for the reverberation. 


5.3.1.3 Perceptually Motivated Sound Field Reproduction 


The most common methods for perceptually motivated sound field reproduction 
are vector-based amplitude panning (VBAP), multiple direction amplitude panning 
(MDAP), and Ambisonics panning, which aim at reproducing point-like sources [89, 
Chaps. 1, 3, and 4]. VBAP is extensions of stereo panning to arbitrary loudspeaker 
array geometries. It uses one to three speakers that are closest to the position of 
the virtual source to create a phantom source. MDAP creates a discrete ring of 
phantom sources—each realized using VBAP—around the position of the virtual 
source to achieve that the perceived source width becomes almost independent from 
the position of the virtual source. Ambisonics panning could be thought of as a 
beamformer that uses all loudspeakers of the array simultaneously to excite circular 
or spherical sound field modes. In this case, the position of the virtual source is given 
by the position of the beam. Similar to MDAP, Ambisonics yields virtual sources with 
an almost position-independent perceived width. In all cases, the degree to which the 
width of the sources can be controlled increases with the number of loudspeakers. 

In many applications, these methods are used as a means to reproduce sound 
fields that were analyzed using SIRR, SDM, and DirAC. Two reasons for this are 
their computational efficiency and the fact that they are relatively robust against irreg- 
ular loudspeaker arrays (non-spherical, missing speakers), which are advantages over 
physically motivated approaches. VBAP and MDAP are robust to irregular arrays by 
design (they do not pose any demands on the array geometry). This is not generally 
true for Ambisonics panning, however, the state-of-the-art All-Round Ambisoncs 
Decoder (AIIRAD, [89, Sect. 4.9.6]), which combines VBAP and Ambisonics pan- 
ning, can well handle irregular arrays. 
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5.3.2 Binaural Synthesis 


The fundamental theorem of binaural technology is that recording and reproducing 
the sound pressure signals at a listener’s ears will evoke the same auditory perception 
as if the listener was exposed to the actual sound field. This is because all acoustic 
cues that the human auditory system exploits for spatial hearing are contained in the 
ear signals. These cues are interaural time and level differences (ITD, ILD), spectral 
cues (SC), and environmental cues. ITD and ILD stem from the spatial separation 
of the ears and the acoustic shadow of the head and make it possible to perceive the 
position of a source in the lateral dimension (left/right). Spectral cues originate from 
direction-dependent filtering of the outer ear and enable us to perceive the source 
position in the polar dimension (up/down). The most prominent environmental cue 
might be reverberation from which information about the source distance and the 
size of a room can be extracted. For more information please refer to Blauert [7] 
and to Chap. 4 of this volume. 

An example of a binaural processing pipeline with headphone reproduction is 
shown in Fig. 5.11. The processed binaural signals are stored or directly streamed 
to the listener whereby the signals are selected and/or processed according to the 
current position and head orientation of the listener. In any case, a physically cor- 
rect simulation requires compensating the recording and reproduction equipment 
(loudspeakers, microphones, headphones) to assure an unaltered reproduction of the 
binaural signals. These compensation filters are usually separated for signal acquisi- 
tion and reproduction to maximize the flexibility of the pipeline. For the same reason, 
anechoic or dry audio content is often convoluted with acquired binaural impulse 
responses, which makes it possible to change the audio content, without changing 
the stored binaural signals. The next sections detail the blocks of the introduced 
reproduction pipeline one by one. 


acquisition process store reproduce 
(measure or simulate) and/or 
stream 
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Fig. 5.11 Example of a headphone-based pipeline for binaural synthesis. Dashed lines indicate 
acoustic signals; black lines indicate digital signals; gray lines indicate movements in 6DoF. He 
denote compensation filters for the recording (yellow) and reproduction equipment (red, CC-BY, 
Fabian Brinkmann) 
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5.3.2.1 Signal Acquisition and Processing 


The most basic technique is to directly record sound events—for example a concert— 
with a dummy head, i.e., a replica of a human head (and torso) that is equipped with 
microphones at the positions of the ear channel entrance or inside artificial ear chan- 
nels. This requires a straightforward compensation of the recording microphones by 
means of an inverse filter, whereas the sources are considered to be a part of the scene 
and thus remain uncompensated. This approach is, however, very inflexible because 
the position and orientation of the listener and sources can not be changed during 
reproduction. It is thus more common to measure or simulate spherical sets of head- 
related impulse responses (HRIRs) that describe the sound propagation between a 
free-field sound source and the listeners ears (cf. [88, Chaps. 2 and 4] and Fig. 5.12). 
In this case, the sound source has to be compensated as well. The gain in flexibility 
stems from the possibility to use anechoic or dry audio content and select the HRIR 
according to the current source and head position of the listener. While HRIRs are 
not often directly used because anechoic listening conditions are unrealistic for most 
applications, they are essential for room acoustic simulations [80]. Acoustic simula- 
tions can be used to obtain binaural room impulse responses (BRIRs) that describe 
the sound propagation between a sound source in a reverberant environment and the 
listeners ears. BRIRs can also be measured, thereby increasing the degree of real- 
ism at the cost of increasing the effort to measure BRIRs for multiple positions and 
orientations of the listener to enable listener movements during playback. 


Fig. 5.12 HRIR measurement system at the Technical University of Berlin with details of the 
position procedure using cross line lasers. During the measurement, the subjects are wearing in-ear 
microphones, are sitting on the chair in the center of the loudspeaker array, and are continuously 
rotated to measure a full spherical HRIR data set. In addition, the wire frames on the floor are 
covered with absorbing material (CC-BY, Fabian Brinkmann) 
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5.3.2.2 Head Tracking 


Tracking the head position of the listener is required for dynamic binaural reproduc- 
tion, i.e., a reproduction that accounts for movements of the listener by providing 
binaural signals according to the angle and distance between the source and the lis- 
tener’s head. While it will be sufficient for some applications to only track the head 
orientation, the general VR/AR case requires six degrees of freedom (6DOF, i.e., 
translation and rotation in x, y, and z). 

In general, two tracking approaches exist. Relative tracking systems track the 
position of the listener with respect to a potentially unknown starting point, while 
absolute tracking systems establish a world coordinate system within which the 
absolute position of the listener is tracked. Relative systems usually use inertial 
measurement units (IMU) to derive the listener position from combined sensing of 
a gyroscope, an accelerometer, and possibly a magnetometer. Absolute systems can 
use optical tracking by deriving the listener position from images of a single or 
multiple (infrared) cameras, or GPS data. 

Artifact-free rendering requires a tracking precision of 1° and 1 cm [32, 46], and 
a total system latency of about 50 ms [45]. Note that a significantly lower latency 
of about 15 ms is required for rendering visual stimuli in AR applications [39]. A 
challenge for relative tracking systems is to control long-term drift of the IMU unit, 
while visual occlusion is problematic for optical absolute tracking systems. 


5.3.2.3 Reproduction with Headphones 


Headphone reproduction requires a compensation of the headphone transfer function 
(HpTF) by means of an inverse filter to deliver the binaural signals to the listener’s ear 
without introducing additional coloration. However, the design of the inverse filter is 
not straightforward. Two aspects are problematic. First, the HpTF considerably varies 
across listeners and headphone models, which may require the use of listener and 
model-specific compensation filters depending on the demands of the application. 
Second, the low-frequency response and the center frequency and depth of high- 
frequency notches in the HpTF strongly depend on the fit of the headphone and may 
considerably change if the listener re-positions the headphones (cf. Fig. 5.13). To 
account for the variance, the average HpTF can be used to design the inverse filter, 
and the filter gain at low and high frequencies can be restricted using regularized 
inversion [24, 46]. Once calculated, the static headphone filter can be applied to the 
binaural signals by means of convolution. 

In addition to this static convolution, a dynamic convolution is often required to 
render the current HRIR or BRIR. Since real-time audio processing works on blocks 
of audio, this is simply achieved by using the current HRIR as long as the listener 
does not move. If the listener moves, the past and current HRIR are both convolved 
simultaneously and a cross fade with the length of one audio block is applied between 
the two [82]. 
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5.3.2.4 Reproduction with Loudspeakers 


While delivering binaural signals through headphones is the most obvious solution 
due to the one-to-one correspondence between the two ears and two speakers of the 
headphone, two approaches for transaural reproduction using loudspeakers are also 
available. 

The first approach uses only two loudspeakers. In analogy to headphone reproduc- 
tion, there is a one-to-one correspondence between the ear signals and speakers, and 
the filter for the left loudspeaker compensates for the transfer function between the 
speaker and the left ear. In contrast to headphone reproduction, however, this requires 
an additional filter for cross-talk cancellation (CTC) between the right speaker and 
the left ear (the filters for the right ear work accordingly). This requires an iterative 
design of the compensation filters for all possible positions of the head with respect 
to the loudspeakers and thus a dynamic convolution already for the compensation 
filters [51]. Optionally, more loudspeakers can be used to optimize the system for 
different listening positions or frequency ranges. 

The second approach uses linear or circular loudspeaker arrays. Here, the idea is 
to shoot two narrow audio beams in the direction of the listener’s ears. Because the 
beams concentrate most of their energy towards the listener’s ears, a high separa- 
tion between the left and right ear beams can be achieved depending on the array 
geometry [20]. In this case, a one-to-one correspondence is established between the 
two beams and the ears, and cross-talk compensation is not required if the beams 
are sufficiently narrow. In this case, a dynamic convolution is required to update the 
beamformers according to the listener’s position. 


5.3.3 Binaural Reproduction of Synthesized Sound Fields 


It is worth to note that SFS approaches can be combined with binaural reproduc- 
tion, either by virtualizing the loudspeaker array with an array of HRIRs or through 
binaural processing stages that build upon the sound field analysis (c.f., [2], [64, 
Sect. 6.4.2] and [89, Sect. 4.11]). This makes binaural reproduction the prime frame- 
work for rendering spatial audio in AR/VR and SFS a versatile tool within the frame- 
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work: First, SFS makes it possible to efficiently render binaural signals for arbitrary 
head orientations from a single SRIR (might require pre-processing to achieve a rea- 
sonable quality as detailed in Sect. 5.4.3). Second, SFS makes it possible to include 
listener movements (translation)—to a limited extent—and thus enables rendering 
with 6DoF. The realization of 6DoF rendering depends on the sound field representa- 
tion, which strongly differs across SFS approaches. However, the general idea agrees 
in many cases. Head rotations can be realized by an inverse rotation of the sound 
field. For perceptually motivated SFS methods, translation can be realized by manip- 
ulating the directions and times of arrival that were obtained through SFA according 
to the listener’s movements (e.g., [41]). The possibility of realizing translation with 
physically motivated SFS approaches and measured sound fields is, however, rather 
limited as this would require arrays with hundreds if not thousands of microphones. 


5.4 System Performance 


This section details the quality that can be achieved with the different reproduction 
paradigms, starting with binaural synthesis. This is the most common approach, and 
in case it is used in combination with SFS, it also limits the maximally achievable 
quality of the SFS. 


5.4.1 Binaural Synthesis 


The authenticity and plausibility of a reproduction system are without a doubt the 
most integral and comprehensive quality measures and are thus discussed first. How- 
ever, it is also important to shed light on the relevance of individual components in 
the reproduction pipeline. While there are many small pieces that contribute to the 
overall quality, the most relevant might be the individualization of binaural signals, 
head tracking, and audiovisual stimulation, which are discussed separately. 


5.4.1.1 Authenticity and Plausibility 


Headphone-based individual dynamic binaural synthesis can be authentic if reverber- 
ant environments and real-life signals, such as speech, are simulated. For this typical 
use case, 66% of the subjects in Brinkmann et al. [8] could not hear any differ- 
ences between a real loudspeaker and its binaural simulation (cf. Fig. 5.14, bottom). 
However, differences such as coloration become audible if simulating anechoic envi- 
ronments or artificial noise signals. Remaining differences stem from accumulated 
measurement errors in the range of 1dB mostly related to the positioning of the 
subject and the in-ear microphones during the experiment (cf. Fig. 5.3, top). Clearly, 
these differences can be detected more easily with steady broadband signals such 
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as noise. The effect of reverberance might be twofold. First, the reverberation might 
be able to mask audible coloration in the direct sound, and second, reverberant parts 
of the BRIR might be less prone to coloration artifacts because measurement errors 
could cancel across reflections arriving from multiple directions. 

Loudspeaker-based individual binaural synthesis by means of CTC can be authen- 
tic in anechoic reproduction rooms [59]. However, the quality drastically decreases 
if the CTC system is set up in reverberant environments, thus limiting the usability 
of this approach. The decrease in quality is caused by undesired reflections from the 
reproduction room that can not be compensated in practice due to uncertainties in 
the exact position of the listener [69]. 

Non-individual dynamic binaural synthesis is not authentic but can be plausible, 
i.e., matching the listeners expectation towards the acoustic environment. This means 
that differences between a real sound field and a non-individual simulation are audible 
in a direct comparison, but they are not large enough for the simulation to be detected 
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Fig. 5.14 Results of the test for authenticity. Top: Range of differences between the sound field of the 
real and virtual frontal loudspeakers across head-above-torso orientations. Data was measured at the 
blocked ear channel entrance and is shown as 12th (light blue) and 3rd octave (dark blue) smoothed 
magnitude spectra. Bottom: 2-Alternative Forced Choice detection rates for all participants, two 
audio contents, source positions in front (0°) and to the left (90°), and three different acoustical 
environments (cf. Fig. 5.3). The size of the dots and the numbers next to them indicates how many 
participants scored identical results. Results on or above the dashed line are significantly above 
chance, indicating that differences between simulated and real sound fields were reliably audible. 
50% correct answers denotes guessing (CC-BY, Fabian Brinkmann) 
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as such in an indirect comparison. Although the plausibility was only shown for 
headphone base reproduction of reverberant environments [47, 63], itis reasonable to 
assume that this also holds the simulation of anechoic environments and loudspeaker 
based reproduction in anechoic environments. Remaining differences between real 
sound fields and binaural simulations are discussed in the following section. 

An example setup for testing authenticity and plausibility is shown in Fig. 5.14. It 
is important to note that authentic simulations can only be achieved under carefully 
controlled laboratory conditions. Otherwise, the placement of the headphones will 
already introduce audible artifacts that would be hard to control in any consumer 
application [61]. It can, however, be assumed that such artifacts are irrelevant for 
the vast majority of VR/AR applications, where plausibility is a sufficient quality 
criterion. 


5.4.1.2 Effect of Individualization 


Binaural signals (binaural recordings, HRIRs, BRIRs) are highly individual, i.e., 
they differ across listeners due to different shapes of the listeners‘ ears, heads, and 
bodies. As a consequence, listening to non-individual binaural signals decreases the 
audio quality and can be thought of as listening through someone else’s ears. While 
the decrease in quality could already be seen in the integral measures authenticity 
and plausibility, this section will look at differences in more detail. 

The most discussed degradation caused by non-individual signals is increased 
uncertainty in source localization [57]. Using individual head-related transfer func- 
tions (HRTFs, the frequency domain HRIRs), median route mean squared localiza- 
tion errors are approximately 27° for the polar angle, which denotes the up/down 
source position, and 15° for the lateral angle, which denotes the left/right position. 
Quadrant errors, which are a measure for front—back and up—down confusions (and 
mixtures thereof), occur in only 4% of the cases. A drastic increase of the quadrant 
error by a factor of 5 to about 20% and the polar error by a factor of 1.5 to about 40° 
can be observed if using non-individual signals. Because source localization in the 
polar dimension relies on high-frequency cues in the binaural signal, the increased 
errors can be attributed to differences in ear shapes, which have the strongest influ- 
ence on binaural signals at high frequencies. The lateral error increases by only 2°. 
In this case, the auditory system exploits interaural cues (ITD, ILD) for localiza- 
tion, which stems from the overall head shape. The fact that head shapes differ less 
between listeners than ear shapes explains the relatively small changes in this case. 

Whereas localization might be one of the most important properties of audio 
in virtual acoustic realities, it is by far not the only aspect that degrades due to 
non-individual signals. An extensive qualitative analysis is shown in Fig. 5.15. The 
results were obtained with pulsed pink noise as audio content in a direct comparison 
between a frontal loudspeaker- and headphone-based dynamic binaural syntheses 
using the setup shown in Fig. 5.3. Apart from qualities related to the scene geometry 
(localization, externalization, etc.), considerable degradations can also be observed 
for aspects related to the tone color. In sum, this also lead to a larger overall difference 
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Fig. 5.15 Perceived differences between a real sound field and the individual (blue, left) and non- 
individual (red, right) dynamic binaural simulation thereof. Results are pooled across an anechoic, 
dry, and wet acoustic environment. The horizontal lines show the medians, the boxes the interquartile 
ranges, and the vertical lines the minimum and maximum perceived differences. Scale labels were 
omitted for clarity and can be found in [48] (CC-BY, Fabian Brinkmann) 


and subjects rated the non-individual simulation to be less natural and clear than its 
individual counterpart. As a result, the individual simulation was generally preferred 
(attribute liking), however, the presence was not affected. Because the similarity 
between the individual BRIRs and the non-individual BRIRs used in the test depends 
on the listener, the results for non-individual synthesis have considerably higher 
variance (indicated by the interquartile ranges). 

Differences for individual binaural synthesis are small compared to non-individual 
synthesis. In this case, noteworthy differences only remain for the tone color. These 
differences stem from measurement uncertainties that arise mostly due to positioning 
inaccuracies of the subjects and in-ear microphones. As mentioned above, these 
differences become inaudible if using speech signals instead of pulsed noise. 


5 Audio Quality Assessment for Virtual Reality 169 


Individualization is not only important for HRIRs and BRIRs but also for the 
headphone compensation (HpC). The examples above either used fully individ- 
ual (individual HRIRs/BRIRs and HpC) or fully non-individual (non-individual 
HRIRs/BRIRs and HpC) simulations. Combinations of these cases were investigated 
by Engel et al. [15] and Gupta et al. [26]. As expected, fully individual simulations 
always have the highest quality, and considerable degradations can be observed if 
using individual signals with a non-individual HpC. If an individual HpC is not 
feasible, differences between individual and non-individual signals were only sig- 
nificant for the source direction but not for the perceived distance, coloration, and 
overall similarity. In any case, at least a non-individual HpC should be used because 
differences are the largest for simulations without HpC. 

Many individualization approaches are available that mitigate the detrimental 
effects of non-individual signals to a certain degree [25]. However, they demand 
additional action from the listener to obtain individual or individualized signals. It 
is thus worth noting—and discussed in the next sections—that head tracking and 
visual stimulation are two means to mitigate some effects that do not require actions 
from the listener. 


5.4.1.3 Effect of Head Tracking 


Without head tracking, the auditory scene will move if the listeners move their head, 
which is a very unnatural behavior for most VR/AR applications. Head-tracked 
dynamic simulations in which the auditory scene remains stable during head move- 
ments have thus become the standard. Besides the general improvement of the sense 
of presence and immersion, this has at least two more benefits. 

First, localization errors for non-individual signals decrease if head tracking is 
enabled [52]. While the lateral localization errors remain largely unaffected, front— 
back confusion completely disappears if the listeners rotate their head by 32° or more 
to the left or right. This can be explained by movement-induced dynamic changes 
in the binaural signals. As listeners move their head to the left, the left ear moves 
away from the source if it is in front, and the right ear moves towards it. Because 
this behavior would be exactly reversed for a source behind the listener, the auditory 
system is able to resolve the front—back confusion through the head motion. Up- 
down confusion can be resolved in analogy through head nodding to the left or right. 
Additionally, the elevation error decreases by a third for head rotations of 64° to the 
left or right. This can be explained by the fact that dynamic changes in the binaural 
signals are largest for a frontal source and almost disappear for a source above and 
below the listener. 

The second benefit pertains to the externalization of non-individual virtual 
sources [31]. While sources to the side are well externalized even with non-individual 
signals, sources to the front and rear were often reported to be perceived as being 
inside the head. The most likely reason for this is that signals for sources close to 
the median plane are similar for the left and right ears. In contrast, the ear signals 
differ in time and level for sources to the side. These differences stem from the spa- 
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tial separation of the ears and the acoustic shadow of the head and might provide 
the auditory system with evidence of the presence of an external source. If listeners 
perform large head rotations to the left and right, dynamic binaural cues are induced 
and the externalization of frontal and rear sources significantly increases. 

Despite the positive effects of head tracking, it has to be kept in mind that listeners 
will not always perform large head movements just because they can. The actual 
benefit might thus often be smaller than reported above. However, dynamic cues 
that are similar to those of head movements can also be induced by a moving source, 
which was shown to have a similarly positive effect on externalization [30]. An effect 
of source movements for localization has not yet been extensively investigated. For 
the case of distance localization, it was already shown that active self-motion is more 
efficient than passive self-motion and source motion [22]. 


5.4.1.4 Effect of Visual Stimulation 


Because VR/AR applications usually provide congruent audiovisual signals, it is 
worth to consider the effect of visual stimulation on the audio quality. Interestingly— 
and in contrast to head tracking—visual stimulation can have positive and negative 
effects. 

The possibly most important positive aspect is the ventriloquism effect, which 
describes the phenomenon that a fused audiovisual event is perceived at the loca- 
tion of the visual stimuli even if the position of the auditory event deviates from 
that of the visual event. Median thresholds below which fusion appears are approx- 
imately 15° in the horizontal plane and 45° in the median plane if presenting a 
realistic stereoscopic 3D video of a talker [29]. Comparing this to localization errors 
reported in Sect. 5.4.1.2, it can be hypothesized that localization errors will drasti- 
cally decrease if not completely disappear even for non-individual binaural synthesis 
due to audiovisual fusion and the ventriloquism effect if a source is visible and in 
the field of view. It has to be kept in mind, however, that the degree of realism of 
the visual stimulation—termed compellingness in [29]—affects the strength of the 
ventriloquism effect. Thus, fusion thresholds can decrease for less realistic visual 
stimulation. 

Quality degrading effects can occur if the (expected) acoustics of the visually 
presented room does not match the acoustics of the auditorily presented room—an 
effect termed room divergence. This effect is especially relevant for AR applications 
where listeners can naturally explore real audiovisual environments to which artifi- 
cial auditory or audiovisual events are added. However, room divergence can also 
appear in VR applications for example due to badly parameterized room acoustic 
simulations. Room divergence is not extensively researched up to date, but it was 
already shown that it can affect distance perception and externalization [23, 83]. 
While degradations with respect to these qualities might as well be mitigated by the 
ventriloquism effect [56], the room divergence might also affect higher level qualities 
such as plausibility and presence. 
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5.4.2 Sound Field Synthesis 


The discussion of SFA/SFS is limited to perceptually motivated approaches because 
they are predominantly used in VR/AR applications. In-depth evaluations of phys- 
ically motivated approaches were, for example, conducted by Wierstorf [85] and 
Erbes [17]. 


5.4.2.1 Vector-Based and Ambisonics Panning 


The most important quality factor for loudspeaker-based reproduction approaches is 
the number of loudspeakers L. In case of Ambisonics, there is a strict dependency 
between L and the achievable spatial resolution, which is determined by the so-called 
Ambisonics order N < (L + 1)’. Intuitively, the spatial resolution increases with 
increasing Ambisonics order. For the amplitude panning methods, the fluctuation 
of the perceived source width across source positions (VBAP) and the minimally 
achievable source width that is independent of the source position (MDAP) increase 
with L. 

Both approaches—vector-based and Ambisonics panning—have distinct disad- 
vantages at very low orders N < 2, i.e., for arrays consisting of only about four to 
nine loudspeakers. In this case, Ambisonics and MDAP have a rather limited spatial 
resolution and Ambisonics additionally exhibits a dull sound color. For VBAP, on 
the other hand, the source width heavily depends on the position of the virtual source. 
Using state-of-the-art Ambisonics decoders, the differences between the approaches 
decrease at orders N = 3, i.e., for arrays consisting of 16 loudspeakers or more. 
For such arrays, all methods are able to produce virtual sources whose width and 
loudness are independent of the source position. For an in-depth discussion of these 
properties the interested reader is referred to Zotter and Frank [89, Chaps. 1 and 3] 
and Pulkki et al. [64, Chap. 5]. 


5.4.2.2 SIRR, SDM, and DirAC 


Different versions of SIRR and DirAC have been proposed over the past years. The 
two most advanced versions are the so-called Virtual Microphone DirAC, which 
improved the rendering of diffuse sound field components over the original DirAC 
version, and higher order DirAC/SIRR, which make it possible to estimate more 
than one directional component for each time frame to improve the rendering of 
challenging acoustic scenes [53, 64, Chaps. 5 and 6]. For an array consisting of 
16 loudspeakers that are set up in acoustically treated environments (anechoic or 
very dry), SIRR and DirAC can achieve a high audio quality of about 80-90% on a 
MUSHRA-like rating scale (cf. Sect. 5.2.1.1). Best results are obtained for idealized 
microphone array signals, i.e., if the SIRR/DirAC input signals are synthetically 
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Fig.5.16 Perceived differences between a reference and order limited binaural renderings of micro- 
phone array recordings. For details refer to [49] (CC-BY, Tim Lübeck) 


generated instead of recorded with a real microphone. Using a real microphone 
array decreased the audio quality by about 10% on average. 

Similar audio qualities were obtained for SDM [78] and binaural SDM [2]. The 
latter study showed that binaural SDM has a plausibility score similar to sound 
fields emitted by real loudspeakers. Although the plausibility score differs from the 
definition of plausibility in Sect. 5.2.2.2, it is reasonable to assume that SDM—and 
also SIRR and DirAC—can be plausible, however, not authentic. 

So far, perceptual evaluations were conducted in acoustically treated listening 
rooms and it is plausible to expect that the quality decreases with an increasing 
degree of reverberation in the listening environment. Moreover, a comprehensive 
comparative evaluation of SIRR and SDM is missing to date and existing studies 
sometimes used test conditions that might have favored one approach over the others. 

SIRR, SDM, and DirAC might be the most common, but by far, not the only 
methods for perceptually motivated SFS. Broader overviews are, for example, given 
by Pulkki et al. [64, Chap. 4] and Zotter and Frank [89, Sect. 5.8]. 


5.4.3 Binaural Reproduction of Synthesized Sound Fields 


As mentioned before, SFS approaches can be reproduced via headphones if virtual- 
izing the loudspeaker array with a set of HRTFs. The virtualization is uncritical if 
the number of virtual loudspeakers can be freely selected, which often is the case for 
SIRR, SDM, and DirAC. The situation is more difficult, however, for Ambisonics 
signals which are typically order limited to 1 < N <7. The challenge in this case 
is to derive an Ambisonics version of the HRTF data set with the same order restric- 
tion. Without specifically tailored algorithms, an order of N ~ 35 is required for 
an authentic Ambisonics representation of HRTFs and simply restricting the order 
causes clearly audible artifacts [3]. 
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A variety of methods have been proposed to mitigate these artifacts. This com- 
prises a global spectral equalization with or without windowing (tapering) of the 
spherical harmonics coefficients or a separate treatment of the HRTF phase by means 
of (frequency-dependent) time alignment or finding an optimal phase that reduces 
errors in the HRTF magnitude [3, 89, Sect. 4.11]. A comparative study of these algo- 
rithms was conducted by Liibeck et al. [49]. As shown in Fig. 5.16, the differences 
between a reference and binaural renderings are small already for N = 3, at least for 
the best algorithms. 

Another benefit of headphone reproduction is that different reproduction tech- 
niques can be combined to fine-tune the trade-off between perceptual quality and 
computational efficiency. One possible solution is to use HRTFs with a high spatial 
resolution for direct sound rendering (high computational cost, high quality) com- 
bined with Ambisonics-based rendering of reverberant components (cost and quality 
adjustable by means of the SH order) [16]. This exploits the fact, that the spatial res- 
olution of the auditory system is higher for the direct sound than for reverberant 
components [50]. 


5.5 Conclusion 


Section 5.2 gave an overview of existing quality measures for evaluating 3D audio 
content and it became apparent that the underlying concepts can also be used to assess 
audio quality in audiovisual virtual reality. Good suggestions were made to adapt 
the application of these measures for AR/VR by simplifying the associated rating 
interfaces and/or adapting methods for the statistical analysis. Open questions in this 
field mainly seem to relate to the higher level constructs of QoE and presence. It will 
be interesting to see how these can be measured with less intrusive user interfaces 
or—in the best case—with indirect physiological or psychological measures. If such 
methods would be established, it would also be possible to further investigate how 
far these higher level constructs are affected by specific aspects of audio quality. 
Sections 5.3 and 5.4 introduced selected approaches for generating 3D audio for 
AR/VR and reviewed their quality. The current best practice of using non-individual 
binaural synthesis with compensated headphones for audio reproduction can gener- 
ate plausible simulations and can significantly benefit from additional information 
provided by 3D visual content. Recent advances in signal processing fostered the 
combination of SFS and binaural reproduction. This improved the efficiency—a key 
factor for enabling 3D audio rendering in mobile applications—without introducing 
significant quality degradations. One current hot topic in the combination of SFS and 
binaural reproduction is clearly 6DoF rendering. Many algorithms were suggested 
for this, however, their development and even more so their perceptual evaluation 
are still under investigation in the majority of cases. The interested reader may have 
a look at recent articles as a starting point for discovering this field (e.g., [4, 41]). 
A second hot topic is the individualization of binaural technology. The effects of 
individualization were discussed and it was shown that this makes it possible to cre- 
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ate simulations that are perceptually identical to a real sound field. Approaches for 
individualization were, however, not detailed and the interested reader is referred to 
the overview of Guezenoc and Renaud [25]. 

From the user perspective, itis worth to note that an increasing pool of software and 
hardware is available for 3D audio reproduction.’ State-of-the-art audio processing 
and reproduction methods are available as plug-ins that can easily be integrated into 
the production workflow as well as in toolboxes that can be used for further research 
and product development. This is complemented by VR/AR-ready hardware such as 
microphone arrays as well as head-mounted displays and headphones with build-in 
head trackers. 
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Spatial Design Considerations get 
for Interactive Audio in Virtual Reality 


Thomas Deacon and Mathieu Barthet 


Abstract Space is a fundamental feature of virtual reality (VR) systems, and more 
generally, human experience. Space is a place where we can produce and transform 
ideas and act to create meaning. It is also an information container. When work- 
ing with sound and space interactions, making VR systems becomes a fundamen- 
tally interdisciplinary endeavour. To support the design of future systems, designers 
need an understanding of spatial design decisions that impact audio practitioners’ 
processes and communication. This chapter proposes a typology of VR interactive 
audio systems, focusing on their function and the role of space in their design. Spa- 
tial categories are proposed to be able to analyse the role of space within existing 
interactive audio VR products. Based on the spatial design considerations explored 
in this chapter, a series of implications for design are offered that future research can 
exploit. 


6.1 Introduction 


Technologies like virtual reality (VR) offer many ways of using space that could 
benefit creative audio production and immersive experience applications. Using VRs 
affordances for embodied interaction and spatial user interfaces, new forms of spatial 
expression can be explored. Running parallel to VR research efforts in sonic interac- 
tion in virtual environments(SIVE), much of sonic practice exists as applied design, 
either as music making tools [110], experiential products [106], or games [102]. 
Commercial work is influenced by academia, but it is also based on broader pro- 
fessional constituencies and practices not related to sound and music interaction 
design. 
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Much of VR design practice is communicated as professional dialogues, such 
as platform or technology best practice guides [120, 121], or reviews of “lessons- 
learned” in industrial settings [105, 122]. Within these professional dialogues, previ- 
ous research, new technological capabilities, and commercial user research are col- 
lected together to inform communities on how to best support users and task domains. 
For the field of SIVE, and sound and music Computing(SMC) more broadly, there 
is still work to be done to bridge commercial practice and academic endeavours. 
Despite recent works [6, 77], there is a paucity of design recommendations and anal- 
ysis regarding how to build spaces, interfaces, and spatial interactions with sound. 
For the potential of VR to be unlocked as a creative medium, multi and interdis- 
ciplinary work must be undertaken to bring together the disciplines that touch on 
space, interaction, and sound. 

Studying how people make immersive tools, in commercial and academic settings, 
requires a means of framing how spatial design decisions impact users. This brings up 
two problems, what role do commercial artefacts have in broadening research under- 
standing, and how is relevant knowledge generated from such products? Objects, 
prototypes, and artefacts create a context for forming new understanding [46]. By 
analysing an artefact design, research can discover (recover and invent) requirements 
to create technological propositions related to domain-specific concerns [82]. This is 
because an artefact collects designers judgements about specific design spaces [33], 
for instance how to solve interaction problems, and what aspects are of priority to 
users at different points in an activity. However, this means we cannot recover the 
needs of design by direct questioning the users alone. A broader research picture 
is needed, one that integrates action with tools, users, and reflection on devices. 
So, to develop an understanding for future design interventions, research should 
gather diverse data to understand the existing practice and perceived professional 
constituencies.! 

Section 6.2 sets out the problem of space in more detail, highlighting important 
contributions to the design of VR sound and music interaction systems. Section 6.2 
also describes the suitability of typologies to spatial analysis for this research. Fol- 
lowing on from this, Sect. 6.3.1 outlines the approach taken to the design review 
and typology, indicating how relevant work was identified, selected, and coded. 
Section 6.3 sets out a typology of interactive audio systems in VR, and presents case 
studies of spatial design in the field. Section 6.5 looks across analyses and offers ways 
to understand the design space of VR for SMC. Based on findings and reflections, 
Sect. 6.6 proposes actionable design outcomes for further research, then Sect. 6.7 
draws the work to a close. 


' Prototypes are any representation of a design idea, regardless of medium, and an artefact is a 
product or interactive system created for a design intervention/experiment [46]. 
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6.2 Background 


6.2.1 Terminology 


This chapter analyses the spatial design of interactive audio systems (IAS) in VR. 
IAS refers to any sound and music computing system that involves human interac- 
tion that can modify the state of the sound or music system, however, we do not 
review information-only auditory displays or audio-rendering technologies. While 
both auditory displays and rendering technologies do include interactivity in their 
operation, this chapter is interested in the use of interactive sound as the primary 
function in the VR application, rather than when sound is used as an information 
medium or renderer of spatial sounds without interactive feedback beyond head rota- 
tion. No doubt there are significant overlaps in theory and application, that would 
be valuable to explore, but trying to address all aspects in one chapter requires a 
different focus. 

The following research areas pertain to spatial interaction with user interfaces 
(UDs: 


e Spatial user interface (SUI): Human-computer interaction (HCI) with 3D or 2D 
UI that is operated through spatial interaction, graphically or otherwise [59]. 

e Three-dimensional user interface (3DUD: A UI that involves 3D interaction [16]. 

e Distributed user interface (DUI): UIs that are distributed across devices, users, or 
spatial access points [89]. 


There are also many terms to describe virtual spaces used for sound and music; 
in particular, this research is concerned with immersive VR technology, following 
the definition provided in [6]: 


e Virtual—to be a virtual reality, the reality must be simulated (e.g. computer- 
generated). 

e Immersive—to be a virtual reality, the reality must give its users the sensation of 
being surrounded by a world. 

e Interactive—to be a virtual reality, the reality must allow its users to affect the 
reality in some meaningful way. 


The term VR can refer to the hardware systems for delivering immersive experiences 
and to refer to the immersive experiences themselves. Hardware systems can include 
commercial head-mounted display (HMD) technology, such as Oculus or HTC Vive, 
through to complex stereographic projection-based Cave Automatic Virtual Environ- 
ment (CAVEs) [12]. The key thing is that in these immersive environments the visual 
system and interaction capacities are mediated through technological means. In the 
case of social virtual reality (SVR), described in Chap. 8 of this volume, commu- 
nication layers (speech, posture, and gesture) may or may not be mediated through 
technological means, for instance co-located users may share a virtual world via 
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HMD but speech communication is unmediated. Or remote SVR users’ communica- 
tion could be completely mediated by avatar representations and voice over internet 
protocol (VoIP) technology. 


6.2.2 Standing on the Shoulders of Giants, but Which Ones?! 


SMC and SIVE are linked to the larger research field of HCL, so it is common practice 
to adopt HCI research findings on how best to design systems. Below, Sect. 6.2.2.1 
describes two examples of how interaction methods are used in the design of VR 
for IAS. But as research in VR for SMC has developed, researchers have needed to 
define and collect design principles specific to sound and music in VR, this work is 
reviewed in Sect. 6.2.2.2. 


6.2.2.1 Adapting Existing VR HCI Frameworks to Audio System 
Design 


To establish a dialogue around spatial considerations, there is a need to adopt find- 
ings from other VR HCI disciplines. But as with the adoption of HCI evaluation 
frameworks within new interfaces for musical expression (NIME) [78, 91, 98], crit- 
ical understanding of the target domain (SMC) needs to be established [70, 81]. For 
instance, making expressive systems for musical creation or sonic experiences has 
different design requirements than usability engineering [42], or demonstrations of 
interaction techniques [8]. This is not to say that usability engineering is not impor- 
tant, but rather the goal of design and evaluation needs to expand to include sonic 
aesthetic qualities for audio-first spatial scenarios. 


Selection and Manipulation Techniques 


Object selection and manipulation is fundamental to VR environments where users 
perform spatial tasks [52]. At a basic level, there are two main categories that describe 
3D interaction for VR: Direct and indirect interaction techniques [5]. Object manip- 


(a) Direct Interaction (b) Indirect Interaction 


Fig. 6.1 Selection and manipulation mechanics in VR 
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ulation examples of direct and indirect techniques can be seen in Fig. 6.1. Direct 
interaction refers to having ‘virtual hands’; similar to touching and grabbing objects 
in the real world. A benefit of direct interaction is that control maps virtual tasks iden- 
tically with real tasks, resulting in more natural interaction [5]. Indirect interaction 
refers to virtual pointing; like using a laser pointer (ray-casting) that can pickup and 
drop objects in space. Indirect interaction lets users select objects beyond their area 
of reach and require relatively less physical movement. Overcoming the physical 
constraints of the real world provides substantial benefits for the design of virtual 
spaces, as the arrangement of elements can expand beyond body-scaled interaction. 
Across both direct and indirect mechanics, interaction should be rapid, accurate, 
error proof, easy to understand and control, and aim for low levels of fatigue [5]. 
Depending on how they are designed, both direct and indirect interactions enable 
spatial transformations of objects, including rotation, scaling, and translation. 

In adapting this research to sound and music interfaces, we must ask how tech- 
niques impact musical processes and practices. For example, [13] describes the trade- 
offs designers make when picking different control systems for virtual reality music 
instrument (VRMIs). Work that has received less attention in SMC includes how 
to design for some of the unique properties of VR media. The affordances of VR 
expand into non-real interaction, so there is a fuzzy middle ground between direct 
and indirect interaction. For instance, the Go-Go technique enlarges a user’s limbs 
to be able to ‘touch’ distal objects [74]. In broader VR research, techniques like 
the Go-Go are described under the term homuncular flexibility [93]; the ability to 
augment proprioceptive perception of action capacity in VR, adapting interaction to 
include novel bodies that have extra appendages or appendages capable of atypical 
movements. An example of this type of research into IAS can be found in [27], 
where magical indirect interaction was implemented to have audio control objects 
float towards the user based on pinch actions (via Leap Motion sensor attached to 
the HMD). 


User Interface Elements 


Reviewing 3DUI for immersive music production interfaces, [11] proposes three 
categories of representation for sound processes and parameters: Virtual sensors like 
buttons and sliders, dynamic/reactive widgets, spatial structures; Fig. 6.2 provides 
examples. These different representation categories provide a set of design templates 
for audio production SUIs. For instance, fine-grained individual parameter control 
may be better suited to sensor devices with precise control relationships. Whereas, if 
spatio-visual feedback is required about an audio process being applied, a dynamic 
widget is a suitable device to explore. Spatial structures can be used to represent 
sequencers and relationships between parameters; as Sect. 6.4 indicates later, several 
VR audio systems use these to represent either modular synthesis units or whole 
musical sequencers. 
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6.2.2.2 Audio-Specific Design Frameworks 


Design for IASs in VR is a developing field, surfacing the potential for new forms 
of sound and music experience [20]. But the opportunities and constraints of VR 
require critical analysis. For instance, embodied interfaces may offer benefits in 
productivity and creative expression [62], but we still do not know if the same effects 
are gathered by embodied interfaces in VR. Alongside this gap, there are gaps in 
design understanding, with only a few design frameworks addressing how to create 
VR interfaces and interactions for sound and music [6, 11, 77]. Across these works, a 
deep level of design analysis around the fundamentals of perception, technology, and 
action is prevalent. But, in terms of design knowledge to aid designers conceptualising 
space, and the construction of audio interactions and experiences in it, information is 
limited. Below is a review of the spatial aspects implicated in the design guidelines 
of existing VR music system research. 

Reviewing VRMI case studies, Serafin et al. outline nine principles to guide 
design, focusing on immersive visualisation from performers’ viewpoint [77]. Design 
principles support design focus on levels of abstraction, immersion, and imagination. 
Their review of works features many examples of hybrid virtual-physical systems 
and also highlights that VRMI are well suited to multi-process instruments given SUI 
affordances. Regarding system design their principles offer robust advice for musical 
performance but there is a lack of detail on how to go about designing different types 
of spaces and interactions. For instance, within the principles, an emphasis is put on 
making experiences social, but no guidance is provided on the design or evaluation 
of social experience in VR. However, aspects of the case studies do draw attention 
to spatial factors such as menu design can ‘cloud’ the performance space; in large 
interfaces, the mixture of control device and interface design means arm movements 


(a) Button and Sliders (1D & (b) Dynamic Widget (c) Spatial Structures 
2D) 


Fig. 6.2 Types of spatial UI for sound processes. Images from Leap motion VR UI design sprint, 
reproduced with permission from owner, Ultraleap limited 
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and travel distances can be tiring; and the inclusion of physical control systems 
supports natural, body-based interaction. 

Addressing Artful Design for VR sound interaction, Atherton and Wang describe 
a series of design lenses with subordinate principles using case study analysis [6]. 
Their work focuses on the idea of creating totally immersive sonic VR. A central 
concept of their work is the difference between designing for doing as distinct from 
being in VR: “doing is taking action with a purpose; intentionally acting to achieve 
an intended outcome. In contrast, we define being as the manner in which we inhabit 
the world around us” [6]. Expanding on [77]’s suggestion to exploit the ‘magical’ 
opportunities of VR, Atherton and Wang highlight that designers should experiment 
with virtual physics, scale and user perspective, and time, however, these seem to 
be general principles for VR interaction rather than sound-specific opportunities. 
Within their discussions spatial concepts emerge, for instance, designers can phase 
levels interactivity to create different spaces for action in a scene. An actionable 
design idea relating to this is to guide gaze attention throughout a space related to 
narrative elements; want people to stop doing and slow down, just put something in 
the sky above them, as it is not an ideal place to work or interact. Atherton and Wang 
highlight that designers need to determine different languages of interaction. Design 
concepts should move beyond functional language towards things that map well 
to sonic expressions, e.g. instead of physical descriptors like speed of movement 
and gravity on an object, an interaction language would be intensity and weight 
and weightlessness. For Atherton and Wang, play, and particularly social play, is 
a synthesis of doing and being, as it is both an activity and a state. Designers can 


support play by: 


the lowering users’ inhibitions and encouraging them to play; 
engaging users in diverse movement; 

allowing users to be silly; 

making opportunities for discovery in virtual space. 


PAS To = 


Related to play and interaction, on the social level, designers should provide sub- 
spaces within larger worlds and engineer collective interaction scenarios. 


6.2.3 Typologies and Spatial Analysis 


A typology is a classification of individual units within a set of categories that are 
useful for a particular purpose. Typologies support the evaluation of a number of 
different indicators in an integrated manner, based on the identification of relevant 
links or themes. Within architecture, design typologies are a common method of 
spatio-visual analysis [24, 72]. The teaching of architectural systems uses an ordered 
set of types to define areas of interlocking design [22], for instance, in Fig. 6.3 the 
concept of form is described using a series of types and representative examples. 
But typologies can also represent ‘spatial qualities’ regarding interaction, see 
Fig. 6.4 where different creative spaces (meeting rooms, maker spaces) can possess 
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positive and negative attributes for certain activities (socially inviting or separating, 
playful or serious) [84]. It is this interpretive layer within a set of similar objects that 
makes typologies a valuable analysis method. We can step out from just the formal 
representation of space and shape and ask, how does this form or behaviour impact 
human needs and experience. 

Compared to a systematic literature review, a design typology includes references 
to artefacts regardless of whether it has received formal user evaluation or received 
previous research analysis. The reasoning is that much of the work happening in 
the VR music field is happening outside academia, so rather than reflecting design 
parameters only within previous academic dialogues, design understanding should 
also be based on practice. 

Compared to a taxonomy, typology is preferred for this work, as the separation of 
types is non-hierarchical and potentially multi-faceted. Classification is done accord- 
ing to structural features, common characteristics, or other forms of patterns across 
instances. Within a typology, there is no implicit or explicit hierarchy connecting 
different research artefacts and products in VR. Also depending on the granularity 
of the type suggested, a single artefact may exist within two types simultaneously. 
Using typologies, themes of significance can be traced across systems, these patterns 
may describe best practices, observe patterns in interaction, explain good designs, 
or capture experience or insight so that other people can reuse these solutions. 


6.3 Design Analysis 


6.3.1 Methodology 


As a formal process the typology was built upon identification, selection, and coding 
of audio-visual virtual spaces. 

Identification: Literature gathering was achieved by parsing VR examples from 
the Musical XR literature dataset. Practice and product examples were gathered 
across the first author’s thesis research period using search engines, internet forums, 
interviews, and social media [25].” 


Centralized Linear Radial Clustered 


sre A 


Fig. 6.3 Example of a spatial typology of form within architecture, adapted from [22] 


? https://github.com/lucaturchet/Musical_XR_publication_database. 
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1. PERSONAL SPACE 

allows for concentrated ‘heads- 
down’ work (thinking, reading, foster the transfer of information 
writing), deep work, and reflection; i and knowledge (tacit, explicit, 
requires reduced stimulation to and embedded knowledge). 
avoid distraction. 


A: KNOWLEDGE PROCESSOR 


space can store, display, and 


2. COLLABORATION SPACE 

is used for group work, work- 
shops, face-to-face discussions, 
client meetings, or student- 
teacher consultations. 


B: INDICATOR OF CULTURE 
space suggests a specific be- 
havior, either through common 
sense, written or unwritten 
rules, rituals, labels, and signs. 


3. PRESENTATION SPACE C: PROCESS ENABLER 

is used to share, present, and space can provide specific 
consume knowledge, ideas, and La spatial structures or technical 
work results in a one-directional infrastructure that might guide 
way (presentations or exhibitions) or hinder the work process. 


4. MAKING SPACE 

is used for model making and 
building; allows experimentation, 
play, noise, and dirt. 


D: SOCIAL DIMENSION 

space influences social interac- 
tions and facilitates meetings 
and personal exchanges. 


5. INTERMISSION SPACE 

connects other space types; is used 
for breaks, recreation, and transfers; 
includes hallways, stairs, cafeterias, 
and outdoor areas. 


E: SOURCE OF STIMULATION 
space can provide certain stimuli 
(views, sounds, smells, textures, 
materials, etc.). 


Fig. 6.4 Example of a spatial typology within design, taken from [84]. Reprinted from design 
studies, 56, Thoring ef al., creative environments for design education and practice: A typology of 
creative spaces, 54-83, Copyright (2018), with permission from Elsevier and Katja Thoring 


Selection: Findings were assessed for relevance to the analysis. Cases were 
included on the basis of the following criteria; (1) Is the system based on immersive 
VR technology via an HMD? (2) Is the primary function or design intention of the 
artefact related to sound or music? 

Coding: A form of deductive and inductive thematic coding was undertaken, 
based upon thematic analysis [17]. An inductive approach involves allowing the 
data to determine your themes, whereas a deductive approach involves coming to 
the data with some preconceived themes you expect to find reflected there, based 
on theory or existing knowledge. For this research, the deductive element was the 
setting of top-level coding categories (UI, Space Use, Social Engagement, Skill 
Level, Interactions) that probe how a VR IAS was constructed, the questions used 
are available in Table 6.1. The inductive coding reflects themes within the deductive 
categories based on the interface designs. Coding sources would involve: Use of the 
VR system where possible; review online video sources; analysis of images; and 
review of documentation and published literature. In each activity, notes and open- 
coding were undertaken on system design using qualitative data analysis software. 
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Table 6.1 Coding system developed for typology. Bold codes indicate deductive code categories, 


italics are inductive themes 


Code Description 

UI What are the types of UI exploited in the VR interface? 

Screen-like 3D or 2D Ul is used in VR that behaves like a standard screen 
menu or workspace 

3D Objects 3D UL is used for information and action 

None No conventional UI or SUI is provided to users, such as an open 
world terrain or an external musical controller 

Physical No functional UI or SUI offered inside is VE, but external 
hardware musical controller used 

Space Use How is space used in this device? 

Sonic The positions of people or objects in space has an impact on sound 
processing, space as a functional element of the sound design 
process 

Visual Interactive visual feedback provided based on positions or 


orientation in space of people or objects 


Social Engagement 


What number of users was the application designed to support? 


Solo 


Single user spaces with no intelligent-agent interaction 


Collaborative 


Multi-user or single/multi user with intelligent-agent interactions 


Collective Massive multiplayer environments, both human and agent-based 
Skill Level Was the system designed for novices, experts or both? 
Novice 

Expert 

NA No formal user study conducted 

Interactions What is the flow of action and the related system response? 


Sonic- Visual 


Coupling between sound and visual features, where sound changes 
visual features 


Visual-sonic 


Coupling between visual and auditory information, where the 
visual information changessound properties 


Sonic-sonic 


Audio input used to control system features that relate to sound 


After this, the deductive sweep was undertaken where the sources, open-codings and 
notes were reviewed in the context of each deductive category, and this resulted in 
the inductive themes that can be found in Table 6.1. 


6.3.2 Typology of Virtual Reality Interactive Audio Systems 


Here a typology of VR IASs is proposed, delineating how different systems overall 
function and the use of space in their design. The referencing of work in this section 
differentiates between commercial products and academic publications, using two 
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different reference sections for clarity. The typology is split into two broad categories 
within which VR products and research are discussed: 


1. Type of Experience/Application—here we collate instances of products and 
research by their function as a sound and music system in VR. 

2. Role of Space—in this phase we look across the different types of systems to 
suggest how the design of space can be categorised. 


6.3.2.1 Type of Experience 


Most implementations of interactive VR sound and music systems fall into one or 
several of the categories in the subsequent list. Many cited products have no formal 
user testing results available. 


e Audio-Visual Performance Environment: Audience-oriented systems for play- 
back or live performance of compositions with audio-visual interactions [14, 51, 
101, 109]. For audience-oriented systems, interactivity is related to being part of 
a social group of spectators, rather than being able to interact sonically. 
Augmented Virtuality (AV): A VR HMD acts as a visual output modality along- 
side physical controllers or smart objects, creating a AV system [34, 43, 100]. 
This descriptor excludes augmented reality (AR) technologies, such as HoloLens, 
as the visual overlay effect is considered different to the total re-representation of 
visual stimuli that occur in VR [99]. 

Collaborative: Some form of collaborative interaction occurs in the VR audio sys- 
tem (human or agent-based). The interaction must be to directly make sound/music 
together [12, 25, 51, 63, 103, 110, 119], rather than more presentational systems 
like an audience cohabiting with performers in a virtual shared space; denoted 
by the Audio-Visual Performance Environments category. Examples and design 
considerations are described in Sect. 6.4. 

Conductor: Controlling audio-visual playback characteristics of pre-existing 
composition [51, 117]. 

Control Surface: VR as a visual and interactive element to manipulate an existing 
digital audio workstation (DAWs) functionality, e.g., Reaper [104]. 

Generative Music System: Partial or total algorithmic music composition, where 
the sound is experienced in VR space, and/or controlled by spatial interaction in 
VR [57, 116]. 

Learning Interface: VR systems to support the learning of music, either as per- 
formance tutoring, theory, or general concepts in music such as genre [48]. 
Music Game: Systems where gameplay is oriented around the player’s interactions 
with a musical score or individual songs. A good example is Beat Sabre [102], the 
highest selling VR game of all time at the time of publication. 

Narrative and Soundscape: Pieces that integrate interactive audio in virtual real- 
ity [85, 116]. 

Physics Interaction: Physics-based sonic interaction systems [27, 106]. 
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Sandbox*: Designed like visual programming languages for digital sound 
synthesis—such as Pure Data, Max/MSP, and VCVRack—these VR sandboxes 
use patching together of modules to create sound. [112-114] 

Sequencer: Drum and music sequencers in VR. As sequencing is a common thing 
in many musical applications, this category refers to interfaces that are either just 
a sequencer or use sequencing somewhere within their interaction design [27, 63, 
103, 110, 112, 119]. 

Spatial Audio Controller: Mixer style control of spatial audio characteristics of 
sources and effects [9, 25, 27, 43, 69, 90, 104]. 

Sounding Object: Virtual object manipulation with parametric sound output [67, 
68]. 

Scientific Instrument: VR systems designed to test an audio or interaction 
tool/feature, a good example is a VR-based binaural spatialisation evaluation sys- 
tem [35, 73]. 

VR DAW: Virtual audio environment, multi-process 3D interfaces for creation and 
manipulation of audio. Important feature is the recording of either audio or perfor- 
mance data from real-time interaction. Interface abstraction and control metaphors 
may differ significantly to conventional desktop DAWs [12, 27, 88, 103, 110, 119]. 
VRMI: Virtual modelling and representations of existing acoustic instruments or 
synthesis methods [9, 12, 19, 31, 34, 51, 56, 61, 66, 68, 71, 80, 110, 114, 118]. 


Overlaps and Contrasts 


Due to the broad design scopes of some systems, an artefact can appear in mul- 
tiple categories, or exist in a space between two categories. For instance, [51] is 
in Audio-Visual Performance Environments, Collaborative, Conductor, and VRMI. 
While [12] is a technically a VR DAW, the audio and interaction design concept 
is highly idiosyncratic, so it becomes closer to a VRMI. The following statements 
intend to clarify any issues regarding overlaps in terminology. 


Sounding Objects vs. Physics Interaction: Both types refer to physics-based inter- 
actions, sounding objects are when the mesh structures of objects are the source 
of sound generation/control (e.g. scanned synthesis of an elastic mesh), whereas 
physics interactions include collision-based interactions for sound generation or 
use of physics systems to control single or multiple audio features (e.g. parameters 
or spatialisation). The interested reader might refer to Chap. 2 for more details on 
these topics. 

VRMI vs. Sandbox: While both can refer to synthesis methods, sandboxes are 
specifically modular construction environments, whereas synthesis methods in 
VRMIs would be a closed form of synthesiser e.g. playing a DX7 emulator in 
virtual reality. 


3 Category name and description sourced from [4]. 
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6.3.2.2 Role of Space 


Many of the systems outlined above offer novel interaction methods coupled with 
3D visualisation. Looking at how space is used in VR music and audio systems 
provides a different way to group research and design contributions. For simplicity, 
the following categories are presented as discrete areas, but dimensions would also 
be suitable (i.e. systems could belong to several categories, see [15, 39] for examples 
of dimension-based classification for digital musical instrument (DMI)). 


Space as a holder of elements for musical input/sonic control The most domi- 
nant form of spatial design is to use space as a container for interactive elements 
that either produce sound or control sound in some way. Within this category, 
key differences are whether menu-based SUI is used [103], or more object-based 
3DUI is exploited [12]; this is discussed further in the next section. Other works 
include: [19, 27, 31, 56, 61, 63, 66-69, 71, 88, 100]. [104, 109, 110, 112-114, 
118] 

Space as a medium of sonic experience In these sorts of systems, space is woven 
into every aspect of user experience or system design. For instance, in [9], the sonic 
operation of the VR system makes no sense if users do not engage in collaborative 
spatial behaviours [9]. In this category, the relationship of spatial interaction to 
system feedback can be predominantly passive, like a recorded soundscape [85], 
or fully interactive, like an audio-visual arts piece that maps spatial input to output 
modalities [90]. In some cases, visual space may only be a supporting medium for 
a spatial sonic experience [85]. It is worth noting that spatial audio controllers 
are not instantly considered as part of this category. As spatial audio controllers 
deal with controlling and manipulating elements, they are considered to be part of 
the Space as a holder of elements for musical input/sonic control category. Rather, 
this category holds experiences where spatiality is more intrinsically involved in 
the interaction between elements and user experience, whereas in a controller 
system it is a functional relationship. Other works include: [43, 57, 80, 106]. 

Space as a visual resource to enhance musical performance In this category, 
space is primarily used for its visual and spatial representation opportunities 
rather than as a direct control system or as an intrinsic part of the sonic expe- 
rience derived from the system. Designers use space as an extra layer to a music 
performance or system, for example, this can be to: 


1. Present performers’ with enhanced visual feedback related to their Playing 
of a musical instrument [34]; 

2. Provide a space for an audience to contribute to a collective experience of 
musical performance [14]; or 

3. Use space as a place for an audience to convene for a music performance in 
VR [101, 109]. 
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6.4 Spatial Design Analysis Case Studies 


The state of the art in VR audio production and immersive musical experiences 
include single-user and collaborative approaches. In the following case studies, 
the spatial and social design decisions are discussed; noting that each of the sys- 
tems serves different purposes as musical experiences. Our motivation is to further 
detail design typology categories, by understanding and comparing the decisions VR 
designers make. Reviews are broken into four areas: single user systems, collabora- 
tive systems, collective systems, and spatial audio production systems. The reason 
we focus on these previous areas, only within immersive music and interactive sound 
production, is so that design comparisons and implications can have some level of 
shared context. We chose the field of immersive music as a point of shared interest 
between academia and industry. But it would be valuable to probe design decisions 
comparatively between broader fields of SIVE design, for instance, auditory display 
and sound production systems; however, this would be a different contribution. 


6.4.1 Single-User Systems 


Figure 6.5 shows the music room [118], an instrument space containing multiple 
VRMIs that are designed to be played with the VR controllers, following a DAW- 
like workflow of perform and record, then arrange and edit. Instruments include a 
drum-kit, laser harp, pedal steel guitar and a chord harp. The spatial setup mimics a 


Fig. 6.5 Single-user VR spatial design considerations A—music room instrument space, with drum 
kit instrument being used and the recording panel UI visible, displaying previously recorded data 
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(a) SoundStage VR for HTC Vive. Sound con- (b) A Mux composition, highlighting the com- 
trol devices float in space and are connected by plexity of the workspace. 
wires, emulating a modular synthesis workflow. 


(c) LyraVR user editing a 3D parameter cube (d) LyraVR: view of 3DUI elements that float 
widget surrounded by other interface elements, in fixed space. 

highlighting the spatial complexity of node- 

edge musical structures. 


Fig. 6.6 Single-user VR spatial design considerations B—sandboxes, node-edge structures and 
modular systems 


conventional studio. In Fig. 6.5 we can see spatial 2D graphical user interface (GUIs) 
presenting recorded information and menu function, while 3DUI objects are used 
to represent instruments, and a 360 photograph of a real studio provides the visual 
backdrop. A design decision of the space was to situate all instruments in a circle 
around the user, presumably to be able to play all the VRMIs in a small physical 
space. Two areas are utilised for the UI, action space and display space. The action 
space is for the VRMIs, and the display space, further away from the user, provides 
a conventional GUI. To be able to interact with the distant GUI, laser pointers are 
used. 

Sound Stage [114] (Fig. 6.6a) and Mux [113] are modular instrument building 
Sandboxes in VR. Users can define their own systems to perform music through 
those systems. Both are multi-process VRMIs designed for room-scale interaction. 
In these systems, a user surrounds themselves with modules and reactive widgets, and 
‘patches’ them up using VR controllers. While stimulating and highly interactive, 
the resulting virtual spaces can be complex and messy spatial arrangements (author’s 
opinion); Fig. 6.6a shows an example of a sound system made with Mux, highlighting 
the spatial-visual complexity. One possible reason arrangements become complex is 


196 T. Deacon and M. Barthet 


because spatial organisation is arbitrary and user-defined. A novel spatial feature is 
that speaker scale controls source loudness, and this turns a slider or number UI into 
a 3DUI interaction process. 

The LyraVR [112] and Drops [106] are two examples of Sandbox systems 
that build the temporal behaviour of the composition using spatial relationships. 
Figure 6.6b and c show LyraVR a musical ‘playground’ where users build music 
sequences in space to create audio-visual compositions. The node-based sequencer 
allows the creation of units in free space. Although aimed at single users, such inter- 
action and playback method would be scalable to collaborative systems. Drops is a 
VR ‘rhythm garden’, where a user creates musical patterns using the interaction of 
objects and simulated gravity. The system requires setting up of object nodes (‘eggs’) 
that releases ‘marbles’ that make a sound when they strike other surfaces—the size 
and release frequency of marbles can be manipulated by the user. By adding more 
surfaces and modifying planes of movement for marbles, the musical composition 
is built using a ‘physical’ design process. In LyraVR, Mux, and Sound Stage, users 
interact with sound elements via spatial node-edge structures, and this gains a level 
of immediacy for musical changes at the cost of vision-spatial complexity. But the 
embodied control of temporal musical behaviour via the arbitrary positioning 3DUI 
does create an experimental creative process driven by interaction in space. 


6.4.2 Collaborative Systems 


Block Rocking Beats [103], LeMo [63], and Polyadic [25] are collaborative music 
making (CMM) Sequencers . However, the systems have different approaches to 
spatial design for collaborative interaction. Both LeMo and Polyadic are the only 
collaborative systems in this review that have undergone formal user studies [25, 63, 
64]. 

Block Rocking Beats, Fig. 6.7a and b, enables avatar-based (head and hands only) 
remote collaborative music production in a virtual sound studio for up to three peo- 
ple. The space is modelled like a futuristic studio, adapting a conventional layout 
of production equipment areas and multiple screens. The environment provides a 
sequencer interface for each user while project information is displayed on a single 
large screen within the environment, and this provides some level of shared visual 
information. Additionally, reactive systems alter environment appearance in sync 
with music created. As a spatial layout, users’ positions are fixed in the space, a few 
meters from each other in a semi-circle facing the front screen. The layout limits the 
capacity to view each other’s workspaces and may inhibit forms of mutual monitor- 
ing. Regarding avatar design, the character’s design is highly stylised, and the ‘hand’ 
representation is designed like a tapered wand. The taper is designed to enlarge the 
usable sequencer area, as when buttons are designed at a normal scale the size of the 
controller would hit multiple buttons. 

The LeMo allows two co-located users avatar-based CMM in VR, using a variety 
of sequencer instruments [63—65]. Depending on experimental condition, different 
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(a) Block Rocking Beats collaborative composi- (b) Block Rocking Beats, view of single user 
tion environment, two users pictured working in display area. 
their own areas. 


(c) EXA collaborative composition environment. (d) EXA collaborative composition environment. 
Freely configurable space for multiple people to View of a user inputting information with an 
musically perform and sequence sounds. instrument context window open to the side. 


(e) LeMo collaborative composition environment [75-77]. Freely configurable space for multiple 
people to sequence sounds. Transparent SUI used to sequence the sounds, sequencers can be 
minimised into ‘bubbles’ to rearrange space. 


(f) Polyadic collaborative drum sequencer [28] 


Fig. 6.7 Collaborative VR music production interfaces 
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spatial features would be activated, such as private workspace areas and spatially 
reactive loudness. Studies of LeMo evaluated visual and sonic workspace design, 
based on the concept of public and private territory, developing design implications 
for SVR; for detailed findings please consult Chap. 8 of this volume. Barring the 
experimental findings, as a spatial design, compared to Block Rocking Beats and 
Polyadic, LeMo allows users to move and rotate their workspaces to accommodate 
social interaction around the task of music making, commonly using face-to-face or 
side-by-side arrangements (see Fig. 6.7e). A novel design feature of note is that SUI 
sequencers can be minimised into ‘bubbles’ to rearrange space. As these sounds are 
spatially located, the bubble acts as both a UI and an audio object. Additionally, the 
inclusion of 3D drawing as acommunication medium enables a variety of annotation 
behaviours. Like Block Rocking Beats and Polyadic, avatar design was rudimentary 
offering a head with gaze direction, however, the use of Leap Motion as the input 
device enables more detailed hand representations. These were used for functional 
input and social communication, e.g. waving and pointing. 

Polyadic enables collaborative composition of drum loops to accompany backing 
tracks for two co-located participants [25]. The system is designed to be instantiated 
in two user interface media, VR and Desktop. The design motivation of Polyadic was 
to compare VR and Desktop media concerning usability, creativity support, and col- 
laboration. In order to create a fair comparison of media, constraints were imposed 
on the design of both media types. This limited the design of features to only use 
control methods that could work equally across both conditions, namely a standard 
step sequencer with per step volume and timing control. In the VR condition, the 
environment uses fixed placement of 3DUI sequencers made up of virtual sensor 
buttons and sliders, see Fig. 6.7f. Low fidelity avatars were utilised to allow rudi- 
mentary social cues. Avatars used a sphere head with ‘sunglasses’ to indicate gaze 
direction and two smaller spheres to indicate hands, enabling simple spatial refer- 
encing. Additionally, each user’s workspace and interface actions were replicated 
within the other users’ environment, enabling referencing and looking at what the 
other is doing. 

EXA [110], Fig. 6.7d, is a collaborative Instrument Space where multiple users 
can compose, record, and perform music using instruments of their own design. EXA 
differs from the previous examples as users input musical sequence information in 
real time using drum-like instruments, rather than pressing sequencer buttons. Once 
sequences are made they can be edited using menus and button presses. Similar to 
LeMo, EXA allows users to freely organise their workspace in line with collaborative 
needs. Also, the custom design of VRMIs introduces idiosyncratic uses of space in 
order to perform each VRMIs. Like others, EXA utilises simple head and hands 
avatars. 
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6.4.3 Collective Systems 


The following reviews are special cases, social VR platforms designed for musical 
experiences, pictured in Fig. 6.8. As predominantly music visualisations in VR, there 
is limited sonic interactivity for users. So the focus is on how these spaces act as 
collective social experiences in VR. For broader discussion of music visualisation in 
XR, see [92]. While not sound production platforms in themselves, the experience 
of a collective engagement in VR, related to audio-visual performance, is an area 
of immersive entertainment where new production tools and design experience are 
required. 

The WaveVR [101], Fig. 6.8a, is a cross-platform social VR experience, like going 
to a ‘gig’ in VR. Artists can use it to make audio-visual experiences for audiences 
across the world. As a virtual space, the shared focus of a stage is used for most 
performances, but the virtual space is reconfigured for each ‘gig’; similar to different 
theatre performances all taking place on the same stage. In one instance, music 
toy spaces were designed for the audience to interact with musical compositions, 
these took the form of objects that change the level of audio effects based on spatial 
position or touch interaction. As the objects cannot all be controlled by one person, 
this creates a collective ‘remix’ of the content [111]. For further reviews of some 
individual ‘gigs’ in The WaveVR see [6]. 

Volta is an immersive experience creation and broadcasting system [108]. Perfor- 
mances are rendered in VR using artists’ existing tools and workflows, such as 
parameter mapping a DAW to drive visual feedback systems. In addition to the 
VR performance, a mixed reality (MR) experience is also broadcast to streaming 
platforms like Youtube and Twitch. Volta differs to The WaveVR in its production 
method for the artist. In The WaveVR developing a performance environment can 
take a development team months to build, and a significant cost. Volta cuts down 
production time by integrating existing tools with spatial experience design templates 
(e.g. particle systems), into a streamlined production process for real-time virtual 
performance environments. + 


6.4.4 Spatial Audio Production Systems 


In the following review of spatial audio production systems in VR, all systems use 
binaural spatial sound presented over headphones (Chaps. 3 and 4 provide an effec- 
tive introduction to such audio technology). It is possible for some of the systems 
(DearVR Spatial Connect, ObjectsVR) to be used with speaker arrays but the design 
implications of this are not considered in this review. 

Addressing spatial audio production, both the Invoke [25] & DearVR Spatial 
Connect [104] systems allow users to record motion in VR to control sound objects. 


4 The first author supported the design of early prototypes of Volta XR, interested readers can review 
the design development at https://thefuturehappened.org/Volta. 
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Development Budd] 


(b) Volta XR’s spatial creation workflow, picturing an audience space in development. 


Fig. 6.8 Collective music experience spaces in VR 


The main functional difference between the systems is that DearVR Spatial Connect 
uses a DAW to host the audio session with the VR system acting as a control layer 
for spatial and FX automation, while Invoke is a self-contained collaborative spatial 
audio mixing system. The systems also differ in their design approach to space and 
sonic interaction. 

Figure 6.9a shows Invoke, a collaborative system that focuses on expressive spatial 
audio production using voice as an input method. The system utilises a mixture of 
direct and indirect spatial interaction to record spatial-sonic relationships. A Voice 
Drawing feature allows for the specification of spatio-temporal sonic behaviour in a 
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(a) Invoke spatial audio controller. Image is of a user manipulating audio trajectory points. 


(b) DearVR Spatial Connect system with all interface modules active and five sources positioned 
in space. 


Fig. 6.9 VR spatial audio production systems 


202 T. Deacon and M. Barthet 


continuous multimodal interaction. Voice input is recorded as loudness automation, 
while a drawn trajectory controls the location of the spatialised audio over time. Using 
an automated process the trajectory is segmented in a bézier curve with multiple 
control points for further collaborative manipulation. The UI design uses a mixture 
of 3DUI (audio objects, trajectories) and semi-transparent “screens-in-space’ (hand 
menus, world-space menus). Spatially, users can navigate the virtual space using 
teleport functionality, all menus travel with the user when they teleport. Invoke is 
the only system in this review to implement more detailed avatar design, each user 
is represented by a body, head and arms, utilising additional sensors on each user to 
provide accurate body-to-avatar positioning. This enabled detailed forms of social 
interaction and spatial awareness [25]. 

Figure 6.9b shows DearVR Spatial Connect, a professional spatial audio produc- 
tion application. The system uses indirect interaction method to control objects in 
space; a laser pointer controls position while the VR controller thumb-stick controls 
distance from the centre. The design of the surrounding space adds no features beyond 
the interface panels and 3DUI (e.g. sound sources), as users commonly project a 360 
video into the production space. Also, the user is ‘pinned’ to the centre of the space, 
again in line with the rendering perspective of spatial audio for 360 video. One issue 
of the central design is a lack of perspective on multiple objects that may be dis- 
tant from the centre. Also, fatigue and motion noise (distant object ‘wobble’ more 
spatially) impact control of objects at a distance (dependent on input device design 
and user-based ergonomic factors like strength and motor control) [5]. Comparing 
this to Invoke, which does not constrain users to the central listening position when 
mixing audio objects, users can freely teleport around to gain different sonic and 
visual/interaction perspectives. This is important as the spatio-temporal mixing of 
sound creates a complex field of trajectories and sound objects [25].° 

ObjectsVR is a system for expressive interaction with spatial sound objects. The 
system provides spatio-temporal interaction with electronic music using 3D geo- 
metric shapes and a series of novel interaction mappings, examples can be seen in 
Fig. 6.10. User hand control is provided via a Leap Motion, and the experience is 
rendered using a HMD. As a spatial audio control system, object positions were a 
mixture of direct manipulation and ‘magical’ physics-based interaction. Users could 
pick up and throw sounds around the space, but an orbiting mechanic meant that 
sound objects would always move back within grabbing distance. A novel spatial 
feature of this environment was the use of contextual UI when users grabbed certain 
objects. When a user grabbed objects that had 3D mappings, a 3D grid of points 
would appear to provide relative positioning guidance. When released the grid fades 
away. System design and evaluation investigate users’ natural exploration and probes 
the formation of understanding needed to interact creatively in VR, full details of the 
evaluation can be found in [27]. © 


5 The first author participated in formal beta testing of the DearVR Spatial Connect product. 


© ObjectsVR was a single-user system designed and tested by the first author during a research 
internship at a VR experience design firm. 
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(a) User experimenting with (b) User utilising magnetic (c) User exploring a bass cube 
Audio-Visual Feedback used grabbing technique to grab transferring it between both 
to indicate an error of insert- object at a distance. Object hands 

ing the wrong type of item controls drum track sound ef- 

into a holder fects 


Fig. 6.10 ObjectsVR interface user interaction examples 


6.5 Discussion and Implications 


6.5.1 Spatial Design Considerations 


Consolidating the reviews of products and research, a series of design parameters 
emerge. 


Complexity of spatial representation 

Based on the analysis of Sandbox systems (Mux and Sound Stage), it is suggested 
that an unrestricted patching metaphor may be too visually complex for applications 
like collaborative audio production in VR. Also, systems that build the timing of 
compositions in space, LyraVR & Drops, suffer from spatial-visual complexity issues. 
Similar to visual programming languages [36], when all points of state-change are 
presented in one space (a low level of abstraction), the information becomes diffuse, 
and errors may become more frequent. Also, when space is used for functional 
relationships, like musical time, visual design cannot bracket the visual complexity 
without the design of abstractions. Related to these issues, the impacts of these design 
features is unknown for collaborative systems. Future research could design systems 
to observe spatial organisation patterns undertaken by users to make sense of, and 
work with arrangements. 


Screens-in-space and workspace zones 

For certain information (selection menus, settings, note sequences), systems use 
either conventional 2D information presentation in a floating screen (Music Room, 
Block Rocking Beats, EXA, DearVR Spatial Connect), or attempt to redesign infor- 
mation using forms of 3DUI (Lyra, Mux). Also, as described in the Music Room 
analysis, space can be delineated into different action or information presentation 
spaces. The decision to locate functionality in screens or more novel 3DUI is an 
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important one for collaborative systems, as each different method offers different 
access points and levels of shared visual information for collaboration. For instance, 
in LeMo, each SUI could be minimised into a bubble for easy arrangement and organ- 
isation. A temptation of VR design could be to embody all interaction in ‘physical’ 
3DUI, such as novel interaction widgets or spatially multiplexed 3DUI (see Fig. 6.2). 
But this could result in added spatio-visual complexity like in Sandbox systems, to 
deal with this there would be a need for contextual interaction layers (e.g. when I put 
a cube here its different from when I put it there), or function navigation using but- 
ton combinations on controllers (VR 3D modelling software do this [107]). Another 
impact of using entirely 3DUI is that it could limit the amount of shared visual infor- 
mation, as arrangements of ‘physical’ objects naturally obscure each other. However, 
3DUI may provide more access points to embodied collaboration. 


Level of acoustic spatial freedom 

Related to spatial audio the ability to move from the centre position is a key design 
decision that needs to be made, especially for collaborative audio production soft- 
ware. For single-user apps, being able to manipulate arrangements, away from the 
sweet spot is of value. For collaborative apps, multiple users located at the sweet 
spot would severely impact normal social interaction. 


Workspace organisation 

For workspace organisation, it should be considered whether fixed or movable UI 
is preferred for certain audio production tasks. For instance LeMo, EXA, and Invoke 
each utilised methods for users to reorganise the SUI, while artefacts like Block 
Rocking Beats and Polyadic did not. 


Control, Play and Expression 

Designers should consider how playful they make spatial audio experiences, or 
whether specific control and sound automation is the design target. For instance, 
in the ObjectsVR system spatial audio objects had ‘magical’ interaction, contrast- 
ing this, DearVR Spatial Connect emulates DAW automation. What is missing here 
is more examples of user experience in mixed systems, and environments to play- 
fully explore spatial sound interactions with levels of direct control and serendip- 
ity. Related to making experience of control more expressive, integrating different 
modalities provides opportunities to expand on the DAW control paradigm, such as 
in Invoke. 


Egocentric spatial design 

Related to the previous two features, some systems (e.g. Mux, Music Room) tend 
towards egocentric spatial patterns, with devices and elements situated around the 
user, oriented to one spatial viewpoint. While making sense for an individual applica- 
tion, these forms of design decisions need to be carefully considered in collaborative 
systems. 


Avatar Design 
An issue of importance to collaborative systems is avatar design and the spatial 
behaviours that they enable. For instance, inside LeMo, the use of the Leap Motion 
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compared to standard VR controllers enabled more detailed forms of hand gesturing. 
Within HCI work has already begun to evaluate avatars based on the constraints of 
commercial VR [53]. What this area should focus on is moving beyond the so-called 
Minimalist Immersion in VR using only simplistic avatar design. Within Invoke, the 
avatar design utilised a more detailed body representation, offering beneficial charac- 
teristics for social space awareness, as users can interpret gaze and body orientations 
along with hand gestures. This highlights an important area of further research for 
collaborative and collective systems, where there should be detailed evaluations of 
the avatar designs’ impact music production activities.’ 


6.5.2 Role of Space and Interaction 


Comparing the separation of the Role of Space with previous research on the space 
of interaction [75], similarities emerge. River and MacTavish analyse space, time 
and information concepts within HCI across a series of paradigms [75]: 


Media Spaces [86]—media types 

Windows, icons, menus, pointer (WIMP) [47]—user space management 
Tangible user interface (TUI) [44]—space-body-thing interaction 
Reality-based interaction (RBI) [49]—emerging embodied interaction styles 
Information spaces [10]—interaction trajectories and navigation of information 
Proxemic interactions [37]—social spatial relationships 


The key spatial dimensions that emerge are: 


Dimension 1 Media and Space Management <> Meaning through interaction 
Dimension 2 Personal and physical <> Social and behavioural 


Dimension 1 describes the difference between conventional GUI design (e.g. 
WIMP) versus approaches using space and the embodiment of technology (e.g. RBI). 
Dimension 1 relates to the previous analysis on the Role of Space (Sect. 6.3.2.2): 


e Space as a holder of elements for musical input/sonic control 
e Space as a medium of sonic experience 
e Space as a visual resource to enhance musical performance 


Dimension 2 highlights how space influences personal and social interactions. This 
is because information is distributed across technologies and is also embedded into 
contextual spaces, from immediate personal space through to social groups and larger 
collective social interaction spaces. Looking at these ideas together, a framework of 
research emerges for VR IAS spatial design. The functional uses of space in VR IAS 
relates to traditional understanding in the design of media types, user space man- 
agement, and TUI. While space as a medium of sonic experience can benefit from 


7 Preprint available at https://hal.archives-ouvertes.fr/hal-03099274. 
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Fig. 6.11 Spatial experience design in VR IAS Venn diagram 


research in the areas of RBI, and information spaces. Finally, proxemic interaction 
can inform things like social spaces for musical enhancement. But this doesn’t go far 
enough. What needs to be included in space for interactive audio is an understanding 
of architectural space. This is because VR designers must make important decisions 
regarding space as an element of user experience. Regarding social aspects, as high- 
lighted earlier in Fig. 6.4 [84], we can design space for functions, activities and for 
their spatial quality. We must design spaces for intimate individual action, shareable 
group interaction, and visibility and safety in large collective action spaces. Acousti- 
cally the sorts of choices we make here matter too. For example, using simple voice 
chat algorithms could make voice intelligibility poor and yield something similar 
to ‘zoom fatigue’ [7]. Instead, we can utilise spatially aware audio communications 
to deliver intelligible audio for each user in an area of space [60], a commercial 
approach to this already exists that can handle hundreds of listener-sources across a 
space [115]. 

We suggest that spaces need elevated priority in our VR design and evaluation 
practices. To support this process, we suggest three top-level spatial categories that 
need to be addressed through interdisciplinary design work: spaces/places, inter- 
faces, interactions. Visualised in Fig. 6.11, some of the elements discussed in this 
chapter are positioned within the different design spaces; for instance, VR selection 
and manipulation techniques sit between interfaces and interactions. For brevity, 
only the category of spaces/places is discussed in detail below, as previous research 
within interfaces and interactions is already well documented in this chapter and other 
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research [6, 12, 77]. The categories scaffold future design by drawing together top- 
ics, theories, and previous art. Addressing elements that overlap with spaces/places 
in Fig. 6.11, we can use the Venn structure to ask new questions about the interaction 
of spaces in feature design. For instance, context-aware on body UI refers to the idea 
that if we have more specific spaces for interaction we can also tune the needs of UI 
to be relevant to that moment in space and time. The notion of putting it on our body, 
like a virtual smart watch, means that this design element is part of both interfaces, 
interactions, and spaces/places. Implicit in such simple categories is the equalising 
of spaces as a design concern alongside more thoroughly investigated work like spa- 
tial interfaces and spatial interaction. Fully describing such a framework is out-with 
the capacity of this chapter; instead, it is offered as a proposition for the research 
field to further explore together. 


6.5.2.1 Spaces/Places 


Spaces are the architectural layouts and areas that form features of a virtual envi- 
ronment used for sound and music activities in VR. An example of a space can be 
seen in Fig. 6.12. In that figure a central production area is enclosed in a grid/cage 
structure, bounding it off from the wider spatial setting of floating “sand-dunes’ and 
night sky. But what does it mean to design for experience within space, and how 
does this related to an IAS? Borrowing from human geography and architecture [22, 
87], some spatial concepts to consider are: 


Boundaries; 

Form and space; 

Organisations and arrangements; 

Circulation (i.e. movement through space); 

Proportion and scale; 

Principles and metaphors (e.g. Symmetry, Hierarchy, Rhythm). 


SO Eee 


Places are spaces with fixed or emergent social meaning [32]. We can aim to 
design the spatial qualities of spaces, for instance, the typology of [84] in Fig. 6.4, 
gives designers ways to conceptualise creative spaces. We can ask, what is the space 
type (e.g. personal or collaborative), and what is the intended spatial quality (e.g. 
knowledge processor or process enabler)? Then we can ask, within those bound- 
aries, what are other spatial characteristics i.e. comfort, sound, sight, spaciousness, 
movement, aliveness/animus? 

As architecture, human geography, and interior design are such deep disciplines, 
interdisciplinary work needs to be done here to produce a dialogue around the design 
of space for sonic and musical expression. One area of mutual influence to consider is 
the design of immersive installations that involve technology to alter user experience. 
VR can learn from techniques and theories in this area [3], as well as be used to 
prototype systems for physical installation. 
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Fig. 6.12 Example of a VR IAS space, invoke artefact’s spatial audio composition area 


6.6 Research Directions and Opportunities 


6.6.1 Embodied Motion Design 


Echoing the design principles within Atherton and Wang’s work [6], motion, embod- 
iment and play are important design spaces to explore. However, human motion and 
spatial analysis is not a new discipline for computing and technology, with special 
research groups such as the International Conference on Movement and Computing 
(MOCO) and the ACM SIGGRAPH Conference on Motion, Interaction and Games 
(MIG). Within these existing dialogues, the role of embodiment is a central topic 
of design [83] (see Chap. 7 for further details). What would differ in virtual spaces 
is a form of synthesis, or symbiosis, between visual and proprioceptive embodi- 
ments. The plural is intentional, as virtual environments may introduce the idea that 
embodiment is not a fixed state, with avatars and motion feedback being augmented 
by the virtual setting. A research problem in this area is determining appropriate 
vocabularies for low-level and high-level motion so that systems of motion analysis 
and mapping can be utilised in an informed way. But the difficulty in VR IAS is 
systems will often need to utilise data from only the headset and controllers, where 
many previous approaches have been developed using high-resolution motion cap- 
ture data [29]. Also, motion design is not just a single person experience. Take for 
instance dancing in a crowd. Research into virtual togetherness through joint embod- 
ied action is a rich direction for collaborative and collective systems to explore [40]. 
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6.6.2 Designing for Collaborative Sound and Music in 
Virtual Reality 


There is a paucity of design and evaluation frameworks addressing social experiences 
in sound and music VR. While work is ongoing in this area. For instance, Men 
and Bryan-Kinns’ chapter in this volume (Chap. 8), to address the gap in design 
knowledge for VR, design perspectives from other embodied CMM and HCI research 
provide valid considerations for the design of SVR. The following integration of 
research from other fields intends to offer SMC actionable research directions to 
support collaboration in VR. 


Adapting Tangible User Interface Research 
An area of potential influence on spatial design for social VR is to look at how 
TUIs are designed to support spatial collaboration. For example, [65]’s research on 
CMM in VR shows similar results to co-located CMM using TUIs [96], regarding the 
design of public and private workspaces. When designing TUIs for co-located CMM, 
spatial orientation and configuration are important design areas. The Hitmachine is 
a tangible music-making tool for children, focused on creating and understanding 
collective interaction experiences [38]. To understand interactions with devices like 
the Hitmachine, there is a need to design social interactions and technology together. 
For designers, this means specifying and evaluating how people distribute attention, 
share attention, dialogue, and engage in collective action. To analyse designs in 
context, spatial formations of peoples’ positions and orientations can be analysed 
to understand different constructions of social play in CMM [38]. Observations of 
social engagement around Hitmachine found that the configuration of space (people, 
furniture, and music interfaces) altered the level of social interaction. Also, regarding 
the design of space in VR, research findings from VR CMM resemble the results 
from the Hitmachine analysis [64]: How spatial encounters are set up for music 
interaction impact social interaction. So, to design collective interaction spaces, how 
basic spatial partitions are implemented matters. 

Another TUI design principle of relevance is to provide multiple access points to 
a collaborative task [45, 76]. This means devising multiple spatial ways for different 
users to act on the same object, creating a form of DUI. Research suggests that the 
more access points participants have to a collaborative task improves how equitable 
participation is [76]. Increasing the tangibility is also said to improve participation. 
This is because users can complement what each other are doing in spatial tasks, 
using space as an organiser of the shared activity [76]. Adapting tangibility to VR 
means designing the affordances of objects appropriately to allow collective spatial 
interaction, while keeping in mind that we can move beyond some of the constraints 
embedded in physical reality. A good example of this is in VR Sandboxes. In physical 
reality, physics governs layout patterns of blocks whereas in VR elements can be 
placed in any part of 3D space. This in turn impacts the design of modules how users 
connect them [6]. But as mentioned previously, idiosyncratic design patterns within 
Sandboxes may need additional support for collaboration, and this is where previous 
TUI work could be integrated [97]. 
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Collectively, these similarities suggest that as a form of spatial collaboration, VR 
CMM can benefit from other non-VR research findings regarding spatial interaction 
to design systems. But, directly importing collaborative design concepts from other 
media should be done carefully, and thoroughly evaluated for any differences in 
results across media (see [25] for a media comparison study focusing on this). 


Designing for Embodiment in Collaboration 

Embodied spatial input and avatar representation are key features of VR for support- 
ing intimacy [54], awareness and coordination [41], and control [1]. Spatial media, 
such as VR, has the capacity for visual and spatial abstraction of UI, something 
needed for the complex requirements of expert music production [28]. The follow- 
ing examples highlight some specific opportunities to support spatial collaboration. 


Augmented Object Interaction The affordances of embodied interaction in SUI 
offer possibilities to transform how joint action on complex digital objects can 
occur [1, 2, 8, 21, 55]. 

Awareness Support Embodied control and spatial representation in VR can ame- 
liorate mutual understanding issues in shared workspaces compared to other 
media [79]; support informal awareness to co-ordinate actions given shared visual 
information [30]; provide pointer mechanisms that support referencing of con- 
tent and environmental objects [23, 94, 95]; allow for the recording of embodied 
motion, as a form of embodied memory within an environment [58, 63]; provide 
novel mechanisms for the division of labour and workspace organisation [64]. 

Spatial Problems Space is a powerful organiser of human memory and can change 
how we solve problems [18, 50], and VR, compared to WIMP systems, is sug- 
gested to alter problem-solving strategy in spatial tasks [50]. 


These considerations have in common an influence on the interaction space in col- 
laboration. This suggests that the collaborative process in sound and music production 
could be improved by designing support for augmented interaction and awareness. 
For example, in a common studio environment, usually, a shared screen (or set of 
screens), a keyboard and mouse, mixing desk with dedicated audio outboards, are the 
tools in the hands of audio producers. In contrast, in an embodied VR interface, the 
possible interaction space can centre around collaborative spaces where functionality 
is engineered to support mutual access and modification, adapting levels of visibility 
and position based on collaborative needs. 


6.6.3 Spatial Audio Production for Immersive Entertainment 


VR provides an ostensibly promising environment for spatial audio production, it 
is an example of professional workflow that could benefit from further research 
into interaction methods in VR. The spatial nature of the technology, and action 
in it, could support problem issues encountered when making audio compositions 
in space (e.g. transformation of spatial reference frames between self and audience) 
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[25, 26]. Regarding the previous analysis, a highly significant research area would be 
the management of complexity in the information design of spatial representation. 
The impact of these improvements would be felt within fields such as immersive 
entertainment, where spatial audio technologies allow the engineering of sound- 
scapes that represent real or imagined sonic worlds, using the location of sounds 
in space as a critical component of audience experience. In particular, there is an 
under-explored research opportunity in VR to enable more collaborative practice 
for spatial audio production. This addresses a need in professional audio production 
communities that look to make content for immersive entertainment.® 


6.7 Conclusion 


Much of how we design VR is based on borrowed design principles. We import ideas 
from other disciplines and hope they ‘fit’. But to capitalise on any opportunities for 
enhanced expression and new forms of sonic entertainment presented by VR, we 
must set out how we design, what that involves, and what that excludes. Given such 
a broad focus embedded in the concept of space, the first goal of any schematic 
representation of design types and guidelines is to find suitable descriptors to collect 
the features relevant to domains of research. For researchers, this means setting out 
the design rationale behind systems clearly, so that over time we can understand the 
emerging practice and propose novel directions. This research offered the beginning 
of this process for the design of IAS for VR, setting out the different functional types 
both research and commercial interests pursue while reflecting on the way space is 
implicated in their design. This provides a framework for spatial design, highlighting 
a set of actionable areas for future design research. From our perspective, a key 
missing piece is guidance about how to design spatial social experiences in VR 
for engagement with sound and music. We need to define the transitions between 
individual, collaborative and collective interaction when it comes to audio interaction. 
A stepping stone in this gap is more research into avatar design for SIVE, as to 
start assessing spatial transitions in social activity we need to understand virtual 
embodiment as the vessel that affords basic social communication beyond speech. 
Looking forward, we should begin to think about what it means to be an immersive 
application designer that is audio-first. Realising that practice will need to integrate 
concepts from acoustics, architecture, phenomenology, HCI and SMC, this calls us 
to think about transdisciplinary pedagogical models to support development in the 
field. 


8 Narrative and physical experiences that engage an audience member in a fictional world, for 
instance immersive VR theatre production. 
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Abstract As the next generation of active video games (AVG) and virtual real- 
ity (VR) systems enter people’s lives, designers may wrongly aim for an experi- 
ence decoupled from bodies. However, both AVG and VR clearly afford opportuni- 
ties to bring experiences, technologies, and users’ physical and experiential bodies 
together, and to study and teach these open-ended relationships of enaction and 
meaning-making in the framework of embodied interaction. Without such a frame- 
work, an aesthetic pleasure, lasting satisfaction, and enjoyment would be impossible 
to achieve in designing sonic interactions in virtual environments (SIVE). In this 
chapter, we introduce this framework and focus on design exemplars that come 
from a soma design ideation workshop and balance rehabilitation. Within the field 
of physiotherapy, developing new conceptual interventions, with a more patient- 
centered approach, is still scarce but has huge potential for overcoming some of the 
challenges facing health care. We indicate how the tactics such as making space, 
subtle guidance, defamiliarization, and intimate correspondence have informed the 
exemplars, both in the workshop and also in our ongoing physiotherapy case. Impli- 
cations for these tactics and design strategies for our design, as well as for general 
practitioners of SIVE are outlined. 
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7.1 Introduction 


I felt that there was an opportunity to create a new design discipline, dedicated to creating 
imaginative and attractive solutions in a virtual world, where one could design behaviors, 
animations, and sounds as well as shapes. This would be the equivalent of industrial design 
but in software rather than three-dimensional objects. Like industrial design, the discipline 
would start from the needs and desires of the people who use a product or service, and 
strive to create designs that would give aesthetic pleasure as well as lasting satisfaction and 
enjoyment [17]. 


Thus spoke the IDEO founder Bill Moggridge in his book Designing Interac- 
tions (2007), on inventing the term “interaction design”. The field Sonic Interaction 
Design was initially concerned with the aesthetic pleasure, lasting satisfaction, and 
enjoyment [24], but more recently the research focus in sonic interaction in vir- 
tual environments (SIVE) has shifted towards the sound spatialization tools and 
techniques. We posit that uniting sound and movement can bring back the desired 
qualities of sonic interaction to SIVE. 

When we reviewed the interaction styles and metaphors in the past SIVE papers 
[24], we noticed how movement was mentioned as an integral part of sonic inter- 
action, and we identified three broad categories of sonic interaction in those papers 
(1) object-focused, (2) direct mapping, and (3) movement-focused [10]. Twenty-six 
papers mentioned the term ‘movement’ in the SIVE corpus (119 times total). Yet, 
no paper in the corpus gave a processual account on how these sound-movement 
interactions are actually designed. In other words, the coupling between movement 
and sound is treated as a black-box in SIVE papers, and the design dimensions such 
as aesthetic pleasure, lasting satisfaction, and enjoyment are not considered. 

This is why we propose the general approach and particular elements of soma 
design for designing interaction in virtual environments. Soma design is a design 
process where designers aim for an improved sensory appreciation through their 
lived, sentient, subjective, purposive bodies—both improving their own design skills 
and sensitivities, but also aiming to deliver designs to end-users [12, 27]. Soma 
design aims to provide aesthetic pleasure, a lasting satisfaction, and enjoyment to a 
wide range of users, also in virtual environments. This aim pertains to the hardest 
living conditions, including but not limited to, aging, frailty, and physical pain. 

This chapter focuses on encounters between soma design and movement-focused 
sonic interaction. By providing selected soma design concepts, design exemplars, 
and tactics, we hope to better articulate the need for movement-sound-interaction 
relations. To do this, we focus on the subtlest manifestation of these relations: the act 
of balance. We start with five soma design concepts we find most related to balance, 
and review three soma design exemplars using these concepts. We then put our 
considerations into an ongoing physiotherapy case study, which is being conducted 
in collaboration with an outpatient rehabilitation center in Frederiksberg, Denmark. 
We finally outline the implications of soma design in our next design phase, as well 
as on sonic interaction design practitioners in general. The structure of this chapter 
follows this narrative. 
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7.2 Soma Design 


Philosophically, soma design is based on Shusterman’s project somaesthetics, which 
is defined as the “critical, meliorative study of the experience and use of one’s 
body as a locus of sensory-aesthetic appreciation (aesthesis) and creative self- 
fashioning” [25]. Somaesthetics has been adapted as a theoretical foundation for 
explaining the aesthetic experience of interaction early on, but Höök has translated 
also the practical aspects of somaesthetics into the design disciplines [12, 27]. 

In 2017, the last author of the current chapter organized a soma design workshop 
with the leading proponents of the approach, design professionals, and about a dozen 
researches. Our focus was the movement, sound, and light design on an actual bridge, 
connecting two buildings in our campus.! We have learned how to pay close attention 
to our bodies and first-person experiences while walking forwards and backwards, 
dancing, and crawling on the bridge (see Fig. 7.1), as well as during collective 
movement and reflective sessions. We also noted how this pragmatic approach differs 
from more cognitively rooted approaches in sound design [18] by putting movement 
in the forefront, and keeping the attention on the entire body or its parts at all times. In 
the following, we iterate the reflections towards experiential virtual environments, by 
visualizing the concepts in Fig. 7.2, as a seed for future multisensory world-making 
sessions in extended realities. 

The Inscription Bridge considers how people use different parts of their bodies 
dominantly while leaving traces on the bridge. The traces will be initiated by light 
and spatialized ambient sound, but will be “carved” on the bridge by its curvature, 
and body parts. The curvature is, 


felt with your balance, how it changes your walking up or down. Which part of the body 
(people) use will change the experience, in a different way every time. 


Smoothed carving and particle rolling sounds could complete the act of inscription. 

The rationale of the second concept, Bridge to Heaven was to make people more 
aware of their surrounding outside of the bridge. In order to build such an awareness, 
the designers decided to create a dangerous zone at the bridge width, and envisioned 
to remove the side walls in a virtual environment. They wanted people to feel the 
danger and tension while passing through an open bridge without any fence. In 
designers’ words 


“An everyday zone and then enter the danger zone as we call it Heaven.” “Totally open 
bridge no fence nothing .... In order to be safe, you have to be aware of the surrounding!” 
“Tension and relief and tension. . .” 


This design concept replaces an interior soundscape with an exterior one, and sonifies 
the danger zone with buzzing, supernatural, electric-like warning sounds. At the 
Heaven side, there will be a localized, granulated, and evolving major-seventh chord 
played by strings and a harp. 


' See http://soma-rhythms-2017.weebly.com. 
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Fig. 7.2 Soma workshop outcomes. a Inscription bridge (left), b Bridge to Heaven (middle), and 
c Awe and wonder bridge (right) 
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The third concept, the Awe and Wonder Bridge concentrated on the ceiling. This 
is a design concept that will be sensible only if people slow down and explore the 
bridge. They will experience a night sky full of stars on the ceiling, and use it as a 
canvas to create their own painting. In designers’ words: 

“We decided to put the emphasis on the ceiling because not to disturb people who don’t want 

to get involved like people who just walk there, drink coffee...” “If you just walk slowly 

and then stop, that might be a start. Because the ceiling is your canvas, you are an artist, 

and you friends are artists as well. . .” “You move, you participate as you slow down... You 
create your painting!” 
This concept clearly took inspiration from Petros Vrellis’ interactive rendering of 
Van Gogh’s painting Starry Night? and affords a similar, granular soundscape with 
localized, high-pitch star tines. 

We hope the workshop process and ideation outcomes provide insight into the 
soma design space, and its relevance for sonic interactions in virtual environments. 
More recently, in a series of investigations, Plant et al. used soma design in tandem 
with critical incident technique for ideation and interactive machine learning for 
computation [22]. Sensory misalignment in virtual environments has informed the 
work of Tennent and colleagues [28]. At the same time, the teaching space of soma 
design has been more widely disseminated [29, 30], and applied to VR [7]. We are 
now able to try out and exchange soma design practices in wide range of domains [10], 
including virtual and augmented realities. Therefore we are in a good position to 
extend the multimodal listening design framework introduced in [26] towards bodily 
interaction through soma design. 

A brief description of some characteristics encompassed by soma design can 
be outlined as follows: subtle guidance (directing focus and attention, for exam- 
ple towards a part of the body, without grabbing attention), making space (slow- 
ing down time, disrupting habitual routines and literal secluded areas), intimate 
correspondence (synchronized feedback loops) and articulate experience (provide 
opportunities to articulate the felt bodily experience). An important grounding in 
these methodologies is the concept of perspectives. Also, the act of defamiliarization 
shapes these characteristics. Defamiliarization, also known as estrangement [31] is 
a tactic to unbalance an established relationship between a movement, interaction, 
or sound (e.g., acousmatic listening) for generating novel design ideas [14]. 


7.2.1 Defamiliarization: Making Strange 


A key aspect of the design approach outlined in [14], and elaborated further in [31] 
is the concept of “Making Strange”. It aims to change certain aspects of a familiar 
activity until automated behavior acquired through habitual practice or experience 
(ingrained somatic habits) is broken, and a reflection on the inner processes is initiated 
within our bodies. The phases of defamiliarization may be grouped into four discrete 


? http://artof01 .com/vrellis/works/starry_night.html. 
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steps [31]: disrupt, destabilize, emerge, and embody. In our bridge workshop, we 
have defamiliarized our everyday experience of passing the bridge by dancing and 
crawling on it, for example. 

Postural stability, also more popularly referred to as balance, could be another 
example for making strange. It is something we all do every day when walking, 
running, sitting, and standing. To really understand what is involved in our balanc- 
ing habits, we need to disrupt them. But engaging in arbitrary disruption might not 
destabilize the core of what we are searching for. Since we usually do not get sonic 
feedback from our balancing activities (except maybe from external auditory stim- 
uli such as a creaking floor, or audible sounds from our joints in acute conditions), 
sound may provide the disruption needed. Within physiotherapy, both static and 
dynamic balance exercises, are often embedded in many therapy programs specif- 
ically targeting elderly, since postural instability generally increases with age [21]. 
The imbalance may be caused by an inability to integrate somatosensory, vestibular, 
or visual information [20]. Ideally, the participants will take on and understand what 
a sonification of balance might entail, through a first person intellectual, visceral, 
and somatic engagement. For an exemplar on balance and its relation to soma design 
and sensory misalignment in virtual reality using vibratory haptics, please see [28]. 


7.2.2 Perspectives 


Soma design distinguishes between three perspective modes, namely the first, sec- 
ond, and third-person perspectives [12, 27]. The third-person perspective conceptu- 
alizes an observatory approach to design, encompassing routine methods in interac- 
tion design such as observing, interviewing, and user testing. The second person is 
important in user-centered or participatory design. Soma design puts forth the case 
of designing from a first-person perspective instead. 

The first-person approach is represented by the designer actively engaging her 
physical body with the artifact under consideration during every part of the design 
process. In other words, this perspective evolves around being the user and attempting 
to experience what they will inevitably experience. Participatory design approaches 
are not neglected in this scenario. Höök argues that in order to make a meaningful 
design artifact, the designer has to take an active part in the participation aspect, not 
merely rely on observations. This creates a stronger coupling between the intended 
design idea (mental map) and how it is perceived by its end-users. 

A related concept was also used in [14], distinguishing between the mover, 
observer, and machine perspectives. The mover perspective is very similar to the 
first-person perspective. It ensures that designers generate first-hand experiences 
about the activity being developed, which remain closely linked to the felt, lived 
experience of the potential user. The observer represents the idea of subjective eval- 
uation through inspection of data, for example, video analysis or motion capture. 
The observer perspective is a loop meant to improve the desired movement through 
performance and subsequent inspection. Any application that uses movement as the 
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primary source of interaction must process and make sense of the inputs. Hence, 
this perspective is about mapping movement captured or recorded by some sensing 
technologies into meaningful representations and/or feedback for the observer and 
mover. The machines currently only capture movement with considerable loss, in 
space, time, or range. Understanding these limitations is crucial in human-computer 
interaction.* Loke and colleagues provide convincing examples of how these three 
perspectives can be combined in design holistically [14]. 


7.3 Soma Design Exemplars for Balance 


7.3.1 Balance Rehabilitation 


The ability to maintain balance is fundamental for an individual’s capabilities to move 
and function independently. Since postural instability declines with age, it puts older 
adults at an increased risk of falling, which can result in severe injuries. Therefore, 
balance training is often a well-integrated part of rehabilitation programs to improve 
balance and self-efficacy in activities of daily living (ADL) [20]. According to [2] 
balance loss usually occurs in a situation where attention is diverted; therefore, many 
interventions seek to embed physical activities that increase body awareness and 
kinesthetic awareness, including but not limited to dance-based training, aerobic, 
and tai chi, to increase balance and reduce falls [13]. However, the training has 
to be repeated procedurally to promote motor learning, causing many patients to 
lose interest and motivation [4]. Both AVG and VR systems have been deployed 
to increase enjoyment and exercise adherence. Most often such systems rely on 
visual, audio-visual, and/or vibrotactile feedback. However, balance deficiencies are 
compensated by both visual or podal dependences, and (static) balance rehabilitation 
often includes exercises that utilize both visual cues (open eyes) and without (closed 
eyes) [15]. In fact, previous research suggests that balance therapy using visual 
deprivation is more effective than when using vision as well, which indicates that 
vision can become a compensatory coping strategy for balance deficiencies [3]. 
Yet, augmented systems rarely rely solely on auditory feedback, meaning that such 
systems likely delimit the user from training other sensory-motor modalities which 
are critical to postural stability [5]. For this reason, it is highly necessary to explore 
how SIVE, focused on auditory feedback only during closed-eyes balance tasks, can 
be used to support balance training. 


3 Readers interested in the machine perspective are referred to the MOCO provocation at https:// 
provocations.online/whatescapescomputation/. 


226 S. B. Olsen et al. 


7.3.2 SWAY 


SWAY is a prototype that seeks to encourage exploration of postural stance and 
stability through somaesthetic experiences [1]. On a high level, SWAY conceptualizes 
a dedicated space. Users within this space are tracked (observed) by a Kinect depth 
camera, which serves as the only means of capturing interactions with the system. 
From the pose (skeleton) acquired through the Kinect software, the authors extract 
an estimate of the center of mass (COM) relative to a fixed origin. Fluctuations of the 
COM in the XZ-axes are used to control two feedback mechanisms. The first element 
is amechanical plate resembling a square bowl, which contains a set of marbles. This 
element is, within the SWAY space, placed in front of the user. The element delivers 
both visual and aural feedback. Micro-movements (fluctuations in the COM) tilt the 
plate, which in turn makes the marbles move. In the words of the authors “...audio 
feedback from the marbles on the wooden platform, creates a soothing soundscape 
that could be compared to the sound of rolling waves” [1, p. 471]. The second 
element is a wooden platform placed at the user’s feet. Two loudspeakers underneath 
the plate propagate vibrations through the material, serving as a haptic feedback. The 
amplitude of the vibration signal is panned across the two speakers depending on 
the current offset of the COM. As a result of the combined modalities experienced 
through these elements, SWAY embraces many of the somaesthetic appreciation 
design concepts [11]. Its innate physicality relates it to making space. The quality 
of subtle guidance towards posture is achieved through the soundscape arising from 
the rolling marbles and the haptic vibrations. SWAY especially seeks to embrace 
the quality of intimate correspondence, with the feedback serving as an amplifying 
mirror of the bodily micro-movements. 


7.3.3  Snap-Snap T-Shirt 


Snap-Snap is a wearable garment embedded with a matrix of magnets spread out 
at even intervals across the back [16]. Through rich haptic feedback, Snap-Snap 
gives information about the posture of the back. Intended for people suffering from 
repetitive strain injury, Snap-Snap seeks to create acute awareness of posture through 
playful and somaesthetic experience. The design process of Snap-Snap is an exemplar 
of utilizing the different perspectives as laid out by soma design. Working primarily 
from a first-person perspective, the designer molds the intentions of the garment 
to fit the perceptions of the co-designer. The co-designer, in turn, provides feed- 
back on their reflections and felt experiences during a three-stage design process. In 
addition, it serves as a good example of the mover-observer perspectives. Switching 
between the designer being the mover, then becoming the observer during trials by 
the co-designer, and vice versa. The result of this design process is that Snap-Snap 
became an excellent example both in terms of using the subtle guidance and intimate 
correspondence qualities of soma design. The strength of the haptic feedback was 
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gradually corrected over the course of the design process, to provide just enough 
attention towards current posture of the back. The close coupling between muscle 
contraction/movement in the back and the haptic vibrations unifies in a feedback 
loop. The final design of Snap-Snap can be linked to the “Making Strange” principle 
as well. The final placements of the magnets within the garment require its user to 
move in uncustomary ways to activate the haptic feedback around certain parts of 
the back. This, in effect, was observed to cause the wearer to move more. 


7.3.4 Slow Floor 


Another prototype closely related to balance and estrangement is slowing down walk- 
ing significantly, as done for example in Butoh dancing, and providing sonic feedback 
on the quality of the micro-movements [8, 9]. The authors collected phenomeno- 
logical accounts of participants walking in relationship to the feedback provided 
by auditory displays. A program of case studies working directly with 13 movers 
from dance and somatic practices in “slow walking” evaluations combined with 
pilot design interventions in exhibition contexts informed the iterative and reflective 
cycles in this research. These case studies reveal themes around the first person felt 
qualities, the variant and exploratory nature of movement, and the rhythmic pattern- 
ing that all result from the pressure-mediated auditory display. The final case study 
derives morphologies and features of micro-movement efforts as variant or invariant 
to movement intention, thus exploring the felt, first-person perspective in relation to 
high-level pressure data resolution. 


7.4 Work in Progress: Balance Rehabilitation 


Given the outline of the design strategy and three design exemplars, we will briefly 
explain how this relates to sonic interaction in virtual environments. A movement- 
based interaction consists of finely nuanced coordination between cognitive effort and 
bodily function, and does not entirely concentrate on the objects in the environment, 
but on the body itself. In that state, sound could be strategically used to maintain 
attention. Therefore, we kept the idea and the sound model of a rolling ball [23], but 
removed its tangible interface. Next, we provide a case study on how we tackle this 
nuanced movement-based sonic interaction in balance rehabilitation. 

We made two visits to a Frederiksberg outpatient rehabilitation center and con- 
ducted semi-structured interviews with the primary contact therapists. These inter- 
views helped us to determine the target group and their needs. Sessions at the rehab 
center in this context consist of a heterogeneous group of people of varying ages and 
diagnoses. Unique sessions for treatment of certain illnesses are available. However, 
the therapists would use a classification of their patient teams as those being “bad” 
and those being “good”. The bad teams are patients who are severely physically 
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indisposed. The good teams are those who are recovering from minor inhibitions. 
Independent of unique illness, age, and severity of physical inhibition, therapists 
would reuse certain exercise programs and schemes. 

In addition to interviews, ethnographic observations were also carried out over 
three physical therapy sessions at the outpatient center. These observations served 
two purposes: (1) gather further insight on the potential target group, and (2) gener- 
ate an understanding of everyday sessions to determine which type of technological 
intervention best fit into daily routines. During these observations, informal inter- 
views were also carried out with both the present therapist and her patients. When 
asked whether the therapist could see herself using a technological artifact during 
her sessions, she was generally positive. She expressed that such a thing could be 
weaved into her program, or in some cases replace another exercise. However, she 
pointed out that if the technology was too difficult to handle (e.g., too complicated 
to understand or too unpractical to maneuver) she would be hesitant to use it. 

A couple of the patients were asked to reflect on their exercises. One patient 
explained, that his view towards an exercise was dependent on the challenge it pre- 
sented. He explained that it was a self-reinforcing effect, whether he enjoyed it or 
not. If the exercise was too difficult or too exhausting, he would gradually come to 
dislike it. A group of patients explained that it was largely dependent on their mood 
on the given day, and what they perceived themselves to be able to do physically. 


7.4.1 System Architecture 


Based on these observations, a virtual prototype based on SWAY was constructed, see 
Fig. 7.3. The primary software running the prototype is a macOS program developed 
in Unity3D using the C# programming language. The program development has 
been realized through object-oriented programming (OOP) principles and has been 
constructed in a modular fashion. The complete architecture can be divided into five 
distinct areas: 


1. Audio Module: The audio module taps into Unity’s built-in audio pipeline. It 
contains the main classes that handle all audio processing. 

2. OSCulator application: An OSCulator application is responsible for commu- 
nicating with a Wii Balance Board (WBB) through a connected Bluetooth port. 
Sensor data from the board is parsed and further broadcast through open sound 
control (OSC). 

3. Balance Board Module: The WBB module is in charge of receiving the OSC 
messages from OSCulator, and interpreting sensor data from the board. It also 
contains the main classes handling physics and game logic. 

4. Interaction Module: The interaction module is the bridge between the WBB 
module and the audio module. It interprets user actions from the WBB module 
and supplies excitation signals to the audio module. 
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Fig. 7.3 Perspectives of the virtual environment (which is invisible to the users) developed on unity 


5. Python Web App: The Python web app is a simple WebAPI that is in charge of 
heavy-duty matrix operations. 


Consider the following scenario. A rehabilitation patient steps onto the WBB, puts 
ona pair of headphones, and closes his/her eyes. By distributing weight across its four 
sensors, the WBB module controls a 3D object in a virtual environment invisible to 
the user, see e.g., Fig. 7.3. A physics simulation in turn makes the object move, and 
its kinematic properties are used to generate excitation signals which are used by 
the audio module to generate feedback to the patient. While this auditory feedback 
is generated by physical model of a rolling ball, and therefore inherently object- 
focused, can we turn the attention back to the body and movement by employing 
soma design elements? 


7.4.2 Soma Design Elements 


The design of the prototype can so far be brought together by describing how the 
different aspects relate to creating somaesthethic experiences. Let us break down 
how the different elements of the experience correspond to certain qualities of soma 
design: 


e Making Space has been approached by several design elements. The prototype 
is meant to be experienced with eyes closed. This should, in theory, force the 
sensory system to weight the vestibular and somatosensory systems higher [12]. 
By placing oneself on the WBB combined with the closing of eyes, transfers 
your mind and body into a dedicated space, both mentally and physically. The 
interspersed moment of standstill slows the down time and provides an opportunity 
for reflection. 

e Intimate Correspondence has been approached through the feedback loops arising 
due to the mapping strategy. This is connected to the aural feedback, which is 
provided by an invisible object controlled by physics. Properties of physics such as 
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inertia extend the movement of the virtual object when attempting to do standstill, 
which in turn extends the aural feedback. This evokes a correctional movement in 
the mover, which results in a feedback loop until a total standstill is achieved. 
Subtle Guidance is achieved design of the aural feedback. The audio is a result 
of a feedback chain starting from the mover, moving through the machine, and 
the effects of a physics system controlling a virtual object. Hence, there is an 
argument for making the audio be physically inspired as well. Recall the SWAY 
project [1], which created a rich soundscape through marbles rolling on a wooden 
platform. Drawing on this inspiration, investigations on the audio design were 
aimed towards real-time synthesis of rolling and bouncing objects. This has been 
established by modal synthesis, as is customary in sound source modeling (see 
Chap. 2 for guidance on this topic). The other components of SIVE, namely (1) 
sound propagation modeling and (2) sound receiver modeling [24] remain to be 
implemented in our prototype. 


7.4.3 Initial Observations 


To evaluate the sound-source modeling of prototype, a small study was conducted. 
The participants we designed with consisted of four patients (mean age = 71, SD 
= 8); three males and one female. Three of the four patients were recovering from 
chemotherapy and one was having general balance issues. Three of the patients 
had never used technology in a rehabilitation context, while one had used it 4— 
5 times. To gather further insight on the felt bodily experiences, the first author 
encouraged participants to “think aloud” (as per the think-aloud method, e.g. [19]), 
or to “articulate experience” (e.g. [11]). 


7.4.4 Test Procedure 


The test was conducted on April 14, 2021 at the outpatient rehabilitation center 
in Frederiksberg, Denmark, during an actual therapy session. The prototype was 
allowed to take the place of an exercise, and be incorporated in a routine therapy 
session (see Fig. 7.4). Before commencing the test, the participants read, understood 
and signed a consent form. The whole evaluation procedure took approximately one 
hour. Each participant was allotted 15 min, whereas approximately 10 min were spent 
trying the prototype and another 5 min to filling in the rating scales. Before trying 
the prototype, each participant was informed about the general purpose of the test. 
They were asked to equip the headphones and step onto the balance board. The board 
was placed behind a chair which the participant could use for support (see Fig. 7.4). 
From this point, the application would be run, and the participant was told to close 
his or her eyes and just explore the space available by distributing weight across the 
balance board. During this time, they were encouraged to report on their general 
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Fig. 7.4 Test setup at the rehabilitation center 


thoughts. After a while, or if the tester recognized that the participant was stuck, they 
were allowed to open their eyes and try the application with visual feedback from 
the otherwise “invisible” virtual environment. After having tried the prototype, they 
were asked to fill out the evaluation surveys. 


7.4.5 Observations 


The first participant (male, age 80) was hesitant to try the prototype at the outset. 
After he was convinced to try it by the present therapist, he struggled to understand 
the concept. While observing the virtual interface, the author noticed that he was 
unable to get the virtual object moving at all, which in turn resulted in little to no 
feedback. During the whole 10 min, even when allowed visual stimuli, he was unable 
to navigate around. Admittedly, he was frail and had a hard time even standing up 
without frontal support. Hence, he could not create enough force for pressure sensors 
in the WBB to recognize his attempts. 

The second participant (male, age 74) did better. Even though he was similarly 
in need of support, he managed to navigate around the virtual environment with his 
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eyes closed, hence producing a feedback. When visual stimuli was allowed, he was 
able to complete several obstacles and manage to score a point. 

The third participant (male, age 58) simply did not comprehend the interaction. 
When asked to elaborate, he explained that he could not perceive what the goal was. 
Again, similar to participant one, he was a bit hesitant to give into the experience, 
and declined to have his eyes closed. He could maneuver around fine, but chose to 
use the support anyways. 

The fourth and final participant (female, age 74) was surprisingly positive. Of the 
four participants, she was the most able and/or agile, but still chose to use the provided 
support. She was able to navigate around using only sound, and even managed to 
explore an obstacle, which unfortunately she could not escape. After allowing her 
visual stimuli, she considerably improved, both in terms of game progression and 
participation factor. Struggling from existing balance problems, she was used to 
doing various rehabilitation exercises, and explained that she had a hard time pushing 
herself to maintain them. She explained, in contrast, that she could see herself using 
the prototype often. However, she expressed that she really did not care about any of 
the aural elements and that they did not affect her in any way. However, just using 
the primitive interface to the virtual environment, she could keep going for a long 
time. 

These observations indicate that we need to work harder to design meaningful 
soma-based physical rehabilitation experiences. We also need to complete the entire 
sound design chain, as well as incorporate other modalities. In addition, opening 
yourself towards somaesthetic experiences and bodily reflections requires a certain 
internal will to do so. Similar notions were observed in SWAY, whose users had a 
hard time reflecting on their felt experiences. As such, one would agree with [12], 
that creating designs which quietly cater towards enabling such reflection is rather 
hard to achieve. 


7.5 Conclusions and Future Work 


This chapter highlighted three themes of soma design that can be useful for designing 
sonic interaction in virtual environments: Making space, intimate correspondence, 
and subtle guidance. These elements should be trained by the designers first, then 
introduced to users. The first ideation workshop describes how they are trained by 
the designers, and the therapy case study illustrates how they are introduced to the 
users. 


e Making Space: Allow your users to be on a dedicated physical or virtual space, 
slow down time, and facilitate inner sensorial tuning and reflection. 

e Intimate Correspondence: Facilitate and embrace the feedback loops. 

e Subtle Guidance: Externalize attention subtly, and try to keep it on movement as 
much as possible. 
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Perspectives and defamiliarization should frame all these elements. We invite 
sound designers to try soma-based approaches and reflect on their design sessions 
regularly and actively. One way of doing this is using body maps before and after 
the design sessions. We regret this was not the case in the case studies reported here, 
but we will include them in the future. 

Body maps are simple sketches of body contour, used to recognize, visualize, and 
reflect on all three elements of soma design outlined above. Besides its ubiquitous use 
in soma design, body mapping currently informs research projects with populations 
marginalized by disability, mental health status, and other vulnerable identities [6], 
enabling diverse technologies such as wearables, virtual reality, and web-based tech- 
nologies. The approach can also have a significant impact on sound design, from 
externalization of sound sources to participatory sense-making in dynamic sound- 
scapes. We plan to implement the three bridges of the ideation workshop in VR, 
together with the body maps and soma sound design principles. 

Finally, the therapeutic applications of soma-based sound design should be further 
developed. While somaesthetics has rich relation to therapeutic movement correc- 
tion through defamiliarization, soma design is yet to embrace this direction with 
technological interventions. We hope to contribute to this line of research by re- 
implementing the ideas and soma-based methods in exemplars such as the Slow 
Floor and using body maps as a reflection tool in our own research. 
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Chapter 8 A) 
Supporting Sonic Interaction in Creative, | asss 
Shared Virtual Environments 


Liang Men and Nick Bryan-Kinns 


Abstract This chapter examines user experience design for collaborative music 
making in shared virtual environments (SVEs). Whilst SVEs have been extensively 
researched for many application domains including education, entertainment, work 
and training, there is limited research on the creative aspects. This results in many 
unanswered design questions such as how to design the user experience without 
being detrimental to the creative output, and how to design spatial configurations to 
support both individual creativity and collaboration. Here, we explore multi-modal 
approaches to supporting creativity in collaborative music making in SVEs. We 
outline an SVE, LeMo, which allows two people to create music collaboratively. 
We then present two studies; the first explores how free-form visual 3D annotations 
instead of spoken communication can support collaborative composition processes 
and human—human interaction. Five classes of use of annotation were identified 
in the study, three of which are particularly relevant to the future design of sonic 
interactions in virtual environments. The second study used a modified version of 
LeMo to test the support for a creative collaboration of two different spatial audio 
settings, which according to the results, changed participants’ behaviour and affected 
their collaboration. Finally, design implications for the auditory design of SVEs 
focusing on supporting creative collaboration are given. 
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8.1 Introduction 


Music has long been produced in social and collaborative ways [16, 67], being inher- 
ently multi-modal, music making includes not only the produced sound itself but 
also other presentations such as body posture[25], physical activation of the instru- 
ment [7], and written symbols and sketches [40, 66] to manage the joint creation and 
production of music. Many of these modalities such as body position are promoted by 
the physical proximity of musicians. Immersive virtual environments (VEs) provide 
a great opportunity to mimic these multi-modal experiences and to explore radical 
sonic interaction design spaces for collaborative music making (CMM) [17, 70], 
such as telepresence for networked performance and composition. Indeed, whilst 
many screen-based collaborative systems treat users as outsiders looking in [3], VEs 
offer an opportunity to truly immerse people into interactions. Compared to tra- 
ditional media, VEs may provide a greater sense of community and more intuitive 
interactions [68], and offer new forms of human-computer interaction [36] and inter- 
personal interaction [34]. Furthermore, VEs have some unique advantages over other 
media to simulate multi-modal senses and enable people to interact in a natural way 
that is similar to the real world. 

However, although VEs have become a hot topic and have been researched in depth 
and the potential of multi-user immersive virtual reality to promote social activities 
has been well established (see AlterspaceVR,! Venues from Oculus’), little attention 
is paid to interpersonal interactions in creativity, which includes collaborative sonic 
interactions, e.g. CMM. This raises many open research questions on how to design 
user experiences in VEs to support collaborative sonic interactions, such as CMM. 
In this chapter we will explore two design features of SVEs, trying to understand 
their roles in supporting collaborative sonic interaction: i) visual annotation and ii) 
acoustic attenuation. 

We will start by reviewing the related work in related areas. Then two studies will 
be presented, with each exploring one of the two features. Finally, the findings of 
the two studies will be compared and implications for supporting collaborative sonic 
interaction in SVEs will be proposed. 


8.2 Shared Virtual Environments 


The term VE can be traced back to the early 1990s [12], and it emerged as a com- 
petitive term to virtual reality (VR). Both are usually equally used to refer to the 
world created entirely by computer simulation [32]. In the mid-1990s, the devel- 
opment of network technology made it possible to connect many users in the same 
VE, prompting the shared virtual environments (SVEs) [53]. In addition to “SVEs”, 
other similar terms being used include multi-user virtual environments, multi-user 


' AltSpaceVR: https://altvr.com. 
2 Venues: https://www.oculus.com/experiences/quest/3002729676463989/ 
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virtual reality [18], collaborative virtual environments (CVEs) [75] and social virtual 
reality (SVR) [19]. To stay consistent, we will herein use the term SVEs to refer to 
VE systems in which users experience other participants as being mutually present 
in the same environment and can interact inter-personally [53]. Whilst single-person 
VEs concern how to create detailed (visual) simulations, the design of SVEs usu- 
ally prioritises enabling collaboration between users [41]. By providing a natural 
medium for three-dimensional collaborative work [6] and allowing multiple peo- 
ple to interact with each other, SVEs are considered emerging tools for a variety 
of purposes, including community activities [31], online education [51], distributed 
work and training [42], and gaming and entertainment [45, 47]. Despite this, there 
is little research in the field of supporting collaborative creativity (such as CMMs), 
which presents the necessity to explore the design space to support the rich forms 
of interpersonal interaction inherent in CMMs, and leaves many open questions: 
whether collaborative creativity in SVEs follows a similar pattern with real-world 
collaborative creativity or not; how to design the virtual environments support cre- 
ative collaboration is also unclear, see [2]. For further discussions on these issues, 
refer also to Chap. 6. 


8.2.1 Embodiment in Collaborative Virtual Environments 


Our bodies provide continuous and immediate information about our presence, activ- 
ity, attention, availability, mood, status, location, identity, capabilities and many 
other factors to ourselves and others, hence using body language explicitly to facil- 
itate communication is recommended [3]. Questions have been voiced in regard to 
embodiment, including the impact of embodiment on users’ social communication 
and behaviour [68], how the avatars’ appearances and behaviours impact users’ sense 
of presence [20, 38, 57, 64] and co-presence [48, 59]. Research suggests that the 
embodiment plays an important role in conveying presence, location and identity 
[3, 4], all of which are crucial to the success of collaboration [16, 21]. Social inter- 
actions in the real world and in virtual environments are regulated by the same social 
norms [73]. An appropriate use of embodiment can enhance the sense of telepres- 
ence [43], the sense of social presence (the feeling that others are present with the 
user in the mediated environment) [3, 43] and promote the sense of community [52]. 
Having embodiment is also beneficial to achieve a better sense of co-worker’s loca- 
tions, actions, intention and construction of workspace awareness, see [24]. The 
embodiment can also create a strong sense of identification, which is essential in 
collaboration since it is a fundamental component in creating workspace aware- 
ness [24], and it can influence collaboration both positively and negatively in group 
work situations [21]. Mutually engaging interactions can be significantly increased 
with proper awareness of the identity of others[16], and in VEs, to a large extent, 
the identification is shaped by the embodiment. As a result, embodiment decisions 
are critical and can influence the quality and scope of collaboration in VR [68]. The 
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avatar might be as basic as a T-shape with eyes to indicate orientation and viewing 
direction, or as sophisticated as a full 3D body scan of the user [58]. 


8.2.2 Collaborative Music Making 


As previously discussed, music making, as a collaborative activity that relies on com- 
mon goals, understanding and good interpersonal communication, has long been a 
key form of collaborative creativity (cf. [16, 67]). Although music making tools for 
multiple users have become more and more popular with the aid of digital technolo- 
gies, this field remains fairly unexplored [29]. In 2003, Blaine and Fels [8] explored 
the design criteria of CMM systems and pointed out the main features including the 
media used, player interaction, learning curve of systems, physical interfaces and 
so on. In the same year, inspired by Rodden’s Classification Space for collaborative 
software [49], otherwise known as groupware, Barbosa developed the Networked 
Music Systems Classification Space [1], which classifies CMM systems in terms of 
the time dimension (synchronous/asynchronous) and space dimension (remote/co- 
located). Examples based on tangible user interfaces include reacTable, where mul- 
tiple users can construct and play the instrument by moving the tangible objects on 
the table [29], and Jam-O-Drum [9], which enables participants to join collaborative, 
musical improvisation. The Music Room provides a room-scale experience, allow- 
ing people without music expertise to compose original music inside an interactive 
space [39]. Syne’n’ Move enables two users to explore a multi-channel pre-recorded 
music piece and users can generate an audio content by synchronising their move- 
ments using mobile phones as a collaborative interface. Another phone-based sys- 
tem is Daisyphone [13], which provides shared editing of short musical loops. Other 
examples include BilliArT [11], which offers a co-located music-making experience, 
and Ocarina [69], which provides a distributed experience. Though many CMMs have 
been developed, most of them rely on users to be in a relatively fixed position, e.g. in 
front of a computer [72]. Potentially, the head tracking and spatialised audio provided 
by VEs can be applied to break this chain and free users. However, this research area 
is little explored, especially for the collaborative aspect. 


8.3 LeMo: An SVE Supporting CMM 


To build a basis for exploring CMM in SVEs, we created Let’s Move (LeMo’), 
which enables two users to manipulate virtual music interfaces together in an SVE 
to create a music loop, see Fig. 8.1. LeMo was programmed in Unity, and models 
and textures were made in Cinema 4D and Adobe Photoshop, respectively. The run- 


3 More information is available at: 
https://sites.google.com/view/liangmen/projects/LeMo 
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LeMo Il 


Fig. 8.1 LeMos enable two players to work together on a music loop in VR (reproduced from [36] 
and [34]) 


time environment includes two HTC Vive headsets (each with one Leap Motion 
mounted, see Fig. 8.1c) and two PCs connected and synchronised via a LAN cable. 
LeMo currently has two major versions: LeMo I and LeMo II (together referred to 
as LeMos). Both LeMos have three key elements: 


Music interface—For producing music. As shown in Fig. 8.2, the matrix interface 
contains a grid of grids/dots. Each row represents the same pitch, forming an octave 
from bottom to top, see Fig. 8.2. Users can edit notes by tapping the grids/dots. 
A vertical play-head repeatedly moves from left to right playing corresponding 
activated notes. In this way, each interface generates a music loop. 
Avatars—Each user has an avatar, including a head and both hands, check Fig. 8.1. 
Avatars are synchronised with users’ real movements in real time, including posi- 
tion and rotation of heads, as well as gestures. So users can not only see their own 
embodiment but also their collaborator’s. 

A virtual space in which users co-present. LeMos provide visual aids for collabora- 
tion by synchronising the virtual environment (virtual space and music interfaces) 
and avatars across a network, providing participants the sense of being in the same 
virtual environment and manipulating the same set of interfaces. 


LeMo I and II have three major differences, which are mainly because LeMo II 


was built later on the basis of LeMo I, and thus provides more and possibly better 
functionalities. These differences are: 


Size of interface matrix of LeMo I is 8*7 while that for LeMo II is 16*8. So 
participants can create an 8-beat loop in LeMo I and can create 16-beat loops in 
LeMo II, see Fig. 8.2. 
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Fig. 8.2 The interfaces of LeMo I and LeMo II 
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(Marimba), 


interface 


Pop button 


Fig. 8.3 a The gesture to generate a new interface; b Matrix (opened interface) and sphere (packed 
interface), double-click the pop button to switch in between (reproduced from [34]) 


e While LeMo I only provides one stationary music interface, LeMo II allows users 
to generate, remove, position and edit up to eight virtual music interfaces. Music 
interfaces in LeMo II have two modes: sphere and matrix (Fig. 8.3b), with sphere 
mainly for storage and positioning, and matrix for music editing. Users can gener- 
ate spheres with pinch and stretch gesture, see Fig. 8.3a. The sphere and the matrix 
form can be switched in between using the pop button at the central bottom of the 
interface, see Fig. 8.3b. Users can have up to eight music interfaces at the same 
time,* which means they can have eight music loops at the maximum at the same 
time. 

e Compared with LeMo I, LeMo II allows users to control more music features; 
users can now use sliders to control tempo, volume and pitch, and use “erase” 
button and “switch” button to erase or switch among four different instruments, 
including piano, drum, marimba and guitar, see bottom part of Figs. 8.2 and 8.3b. 


8.4 Study I—Visual Approach: 3D Annotation 


Writing and sketching are often used in collaboration to exchange ideas, acting as 
a memory aid, conveying approval, ideas, doubts and so on. In the CMM systems 
Daisyphone and Daisyfield in [14], people are given a shared annotation mechanism, 
which enables collaborators to draw lines that are publicly visible. This has been 
suggested as an advantage to music making. Taking inspiration from this, the goal of 
this study is to explore how similar visual cues (e.g. 3D annotations) might impact 
the creative collaboration when it comes to VR setting. We are interested in exploring 
how this capability may be used in an SVE to allow collaborative sonic interactions 
(CMM in this case). 

To explore this, LeMo I enables users to draw 3D lines (annotations) by pinching 
their thumb and index finger together and moving their hands, see left part of Fig. 8.4. 
These 3D lines are shared and visible to both collaborators, and can therefore poten- 
tially be used for communication. To avoid clutter or confusion, users can flip both 


4 We limit the number to 8 to achieve a proper frame rate. 
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before highlighting after highlighting 


Fig. 8.4 All annotations in subsequent figures have been emphasised by darkening the background 
and brightening the annotation lines to enhance their legibility outside of VR (from [33]) 


hands downward to discard all the 3D lines. Users can add or discard lines at any 
time as they wish. 


8.4.1 Participants and Procedure 


Thirty-two participants (16 pairs) were recruited via group emails at the authors’ uni- 
versity and the authors’ social media for this study.° Of the participants, 25% had not 
used VR before, 37.5% of them had tried it only once, nearly a third (28.5%) of them 
played 2-5 times and nearly 10% played VR frequently. Only two rated themselves 
as music experts, with the majority rating themselves as novices in musical field. 
Twelve pairs of participants were familiar with their study partner prior to the study. 
It took each pair of participants roughly 1 h to finish the experiment, participants 
received no compensation. 

After reading and signing informed consent forms, each pair of participants first 
received a tutorial on how to use LeMo I and then undertook a task-free trial of LeMo 
I for 5 min, during which they could change music notes and make annotations, 
helping them get familiar with LeMo. After that, each pair undertook four sessions 
of composing music, each lasting 5 min. They were asked to create a music loop 
that sounds nice to them together. Note only two of these sessions were set for this 
study, in which participants could make annotations. Participants’ annotations were 
recorded and are highlighted for better readability—see an example in Fig. 8.4. The 
study ended with a semi-structured interview (around 5 min). Although participants 
are physically co-located during the experiment, we purposefully did not support nor 
allow spoken communication. This is because the creative content is in the sound 
domain and we are interested in how to design systems which foreground the creative 
uses of sound whilst using complementary modalities to manage the creative process. 


5 The Queen Mary Research Ethics Committee granted ethical approval to carry out the study within 
its facilities (Ethical Application Ref: QMREC1592). 
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Fig. 8.5 Presence annotation: “XiaoB” (a) and “it me” (b), from [33] 


8.4.2 Annotation Categories 


Seventy-eight annotations were post-hoc identified and categorised by the researchers 
according to the annotations for Mutual Engagement classification scheme (referred 
to as aME classification) in distributed music making: presence, making it happen, 
quality, social and localisation [14]. We use aME classification scheme as a starting 
point for understanding the use of annotations in LeMo I. The following subsections 
report on the kinds of annotations participants used when making music together in 
LeMo I, and later sections reflect on these annotations and the utility of the aME 
classification scheme for SVEs. 


8.4.2.1 Presence 


The concept of presence has been defined and interpreted in different ways, e.g. [26, 
62, 63, 71]. Presence is a subjective experience [26, 61] which can greatly affect 
collaboration [22, 50]—having knowledge of oneself and those we are working with 
is important in collaboration. An earlier study found many participants in distributed 
music making used annotations as a way to express and query presence, helping 
participants know about each other’s existence [14]. In this study only two users 
used annotations to convey presence. One wrote “XiaoB” (the participant’s name) 
and the other wrote ‘it me” to tell the collaborators their presence and identity, see 
Fig. 8.5. The reason that much fewer people used annotations to convey presence 
could be that the avatars provided a sense of presence and identity not available in 
the original Daisy studies in [14]. Avatars intuitively show the collaborators where 
they are, what they are doing and where they are looking. Another reason might be 
because the collaborators were co-located and that they had previously met in the 
real world before entering the virtual realm. 


8.4.2.2 Making It Happen 


Annotations were also used to support the process of collaborative music making in 
four ways explored below: 
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Fig. 8.6 Turn taking annotations: “You go ahead” (a); “you make” (b); “I make” written in Chinese 
(c); “you do”(d) (reproduced from [33]) 


e Turn Taking: Although LeMo I allows simultaneous editing of the shared musical 
loop, at some points participants took turns to contribute the musical notes and 
used annotations to manage the process. As shown in Fig. 8.6, participants wrote 
“Let me” or “you do” to switch who had the active role. By doing so, the active 
person could either require or give away full control of the music interface until 
they agree to a turn change—note that there was no explicit ownership control 
of the musical interface, so in these cases participants were self-managing their 
access to the shared musical loop. 

e Composition Thoughts: Some annotations emerged that were expressing composi- 
tion ideas at different levels, covering the highest level—music style, the medium 
level—patterns formed of notes, and the most specific level- single notes. By 
drawing lines aligning with possible notes on the grid, Fig. 8.7b, c, d, e sketch 
out participants’ composition ideas. These are more specific communication com- 
pared with annotations revealing musical ideas (e.g. “Chinese style?” in Fig. 8.7a). 
These annotations were usually drawn before activating the corresponding but- 
tons to make and share a plan, possibly so that the partner could help to construct 
the sequence of notes. Occasionally, these compositional sketches were drawn 
afterwards (e.g. Fig. 8.7e) and were used to demonstrate a musical idea. In both 
cases, this kind of annotation may have helped participants to better formulate and 
understand the collaborative music plan/idea. More directed use of annotations in 
composition is illustrated in Fig. 8.7f where the participant made three dot markers 
near the column reference system (B, G and D specifically), asking the partner 
to make notes in these three columns, which resulted in the partner adding these 
notes to the shared musical loop. A similar case is shown in 7 h, in which the 
partner was asked to make notes in rows C, E and G. Participants also directly 
wrote the reference to ask partners to change specific notes, see Fig. 8.7 h, i, j, k. 

e Area and Position Arrangement: Annotations were also used to divide the working 
area and to manage participants’ work focus in the VE. Fig. 8.8a shows an example 
in which participants drew a horizontal line to divide the music interface into two 
parts, each for one participant. The pair was composed within their own working 
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RER” 


> PIA 7 


Fig. 8.7 “Chinese style?” written in Chinese (a); Patterns formed of notes (b, c, d, e); Note markers 
(£); References of notes (g, h, i, j, k) (from [33]) 


Fig. 8.8 Annotations for working area arrangement (from [33]) 


area after the line was drawn, and later on, a word “Switch” was written to ask to 
switch positions (i.e. to swap from top to bottom and vice versa), see Fig. 8.8b. 
These annotations may have contributed to participants’ working areas and space 
management. 

e Confusion Expressions: Participants used annotation to write “what” or to draw 
a question mark to presumably express confusion about their partners’ activities 
given that such annotations were made directly after their partners changed notes, 
drew, wrote or made gestures. Fig. illustrates typical indicators of confusion. 


8.4.2.3 Quality 


When creating the music loop, reflecting and exchanging the ideas of the quality 
of the piece is crucial to smooth the cooperation and ensure a final output with 
good quality. In LeMo I, participants used annotations to express and exchange 
their judgments of the quality. These annotations are usually short words or simple 
shapes, either positive (e.g. “OK”, “Nice”, “Cool”, “Good” and heart shape) or 
negative (e.g. “No”), as illustrated in Fig. 8.9. Some of the confusion expressions 
such as “?” were probably indicators of queries of quality, not just queries about 
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Fig. 8.10 Confusion annotations (from [33]) 


the process. It is also interesting to note that positive words may convey different 
meanings when temporal relationships change. For example, a “yes” written shortly 
after a note addition means the writer’s satisfaction with the addition while an “OK” 
write much later with a certain addition has fewer relation with the addition and 
means more satisfaction about the whole piece. These emerging annotation-based 
judgments help collaborators exchange feelings about the piece being made, reduce 
the idea variation and strengthen the cooperation on the activity. 


8.4.2.4 Social 


Beyond music making and process management, annotations were also used for non- 
task-related purposes, as illustrated in Figs. and . As shown in Fig. , one 
participant started detailed steps of a social drawing activity, their partner then saw 
this and joined in with the drawing activity and they finished the drawing together. 
It is interesting to note that in total five human doodles appeared, two of which were 
drawn collaboratively. The possible reasons for its frequent emergence could be that 
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VA we: 


C wrote D’s name D drew a culry arro 


Participant C W, 
started with eyes drew the face “Gabana” pointing to D’s name, and 
and mouth contour added hair 


Fig. 8.12 Annotations for social purposes (reproduced from [33]) 


participants were inspired unknowingly by the kinetic avatar or people just naturally 
love to draw faces. Although social annotations did not contribute to the music 
directly, making these lighthearted drawings, as a social interaction, contributes to a 
close relationship between the collaborators. 


8.4.2.5 Localisation 


Bryan-Kinns [14] identified the frequent use of annotation as a localisation cue 
(mainly by drawing arrows), but in LeMo I we only found one similar case, in which 
the participant drew an arrow, and from the review of the interaction successfully 
obtained their partner’s attention, as illustrated in Fig. 8.13. However, in this case 
the arrow may have been more to attract attention to the activity rather than to 
highlight a specific part of the joint creation. The reason that annotations are not 
used for localisation in LeMo I could be that participants could simply draw each 
other’s attention to a certain location by waving their hands and then pointing to that 
location. 
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Fig. 8.13 A participant drew an arrow (a), and this successfully drew their partner’s attention to 
the intended area (b) (from [33]) 


8.4.3 Interviews 


Post-task interviews with participants revealed more reflective insights into the use 
of the annotations. The interviews were transcribed (around 5,000 words) and a 
thematic analysis was undertaken, see more information about thematic analysis 
in [10, 74]. The thematic analysis started with a reading through of the transcript, 
then an inductive analysis of the data was performed, and relevant patterns were 
collapsed into codes. Next, these codes were combined into overarching themes, 
which were then reviewed and adjusted until they were appropriate for the codes. 
In total, 41 codes and 4 overarching themes emerged from the thematic analysis. 
Two themes were directly related to annotation: (i) annotation’s usefulness, and (ii) 
annotation’s problems. 

Many participants described that they had a positive feeling when they could write 
something to support their communication. They reported annotations were used to 
make “signs and symbols” to support composition, or to “create drawing together 
[...] like a physical warm up”. Participants also reported that annotations exceeded 
vocal communication in some ways, “with the lines, [they] could just circle the notes 
to say that was [note] G and go back to [note] C, from that perspective, drawing was 
more effective”. Many participants reported that they successfully understood each 
other’s intentions via the annotations, e.g. one participant drew a line and “used the 
line to affect the partner”, guiding their partner to move notes to lower positions, 
the partner fully understood and reported they “did the changes”. Other examples 
mentioned are showing satisfaction by “writing an OK” or using “Hi” for greetings. 

Meanwhile, writing and reading in 3D space were reported by participants to be 
quite different from the real world and these differences caused inconveniences and 
problems. For instance, the 3D nature of the annotations reduced their readability, 
it only “makes sense to [them] from [their] perspective[s], because it was 3D”. For 
ease of identifying the annotations, “[they] need to stand where the person wrote it 
stood”. Furthermore, making annotations was reported as being time-consuming, and 
“when [they] finish[ed] it, it [did] not make sense” anymore. Also, the low accuracy 
of movement tracking led to annotations being drawn at quite a large size, which 
then led to a limitation of “how much [they] [could] write”. Finally, participants 
reported that it was hard to notice each others’ annotation activities, a participant 
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“waved hands to [their partner], but [the partner] did not see”, the participant “had to 
wave hands [closer], directly in front of [the partner]” to draw their attention to the 
annotations so as to get the annotations read. This was probably due to the narrower 
field of view (FOV) in VR vs real life as the FOV is about 100 horizontal degrees 
with HTC Vive vs about 200 degrees binocular FOV in real life, see [28, 30]. 


8.4.4 Reflection of Study I 


Similar to Bryan-Kinns’ findings [14], most of the annotations that emerged in the 
use of LeMo I fall into three types: making it happen, quality and social. However, 
unlike the aME classification, presence and localisation appear to be well managed 
through avatar interaction. This similarity suggests that 3D annotations can function 
similarly in an immersive collaborative music-making system as they can in a 2D 
non-immersive CMM system. However, much fewer annotations are used to convey 
presence compared with the findings of Bryan-Kinns [14] which may be because 
avatars already support this well, or it may also have been due to the physical col- 
location of participants with LeMo I compared to the Daisy* studies which were 
distributed remotely. The length of the musical loop in LeMo I is 8 beats, whereas 
in the Daisy* studies the length was 48 beats which may have affected the kinds 
of annotation produced as the LeMo I loop was simpler and required less temporal 
organisation. Regardless of these issues, the use of aME to classify annotations in a 
study of CMM indicates that the annotation classification scheme applies to media 
beyond the Daisy* systems it was previously used to evaluate [14]. 

For sonic interaction design of VEs, the findings of this exploratory study indicate 
that 3D graphical annotations of a virtual environment can support a music making 
as a tool for communication where the co-produced sound is prioritised over other 
modalities—CMM in our case. We specifically prevented conversation during the 
creative process to allow us to explore how to support collaboration without interrupt- 
ing or interfering with the music being created by collaborators. The step sequencer 
used in LeMo I was intentionally simple to allow initial exploration of the role of 
annotations without conflating this with the complexity of an interface. For richer 
and more complex sonic creation and exploration in VR, we suggest that annotations 
could usefully support communication about the process, quality and also social 
aspects of interaction without compromising the joint product being produced. It 
may facilitate a foregrounding of the creative sound product to such an extent that 
the sounds created are able to use the full width of the sound domain at the exclusion 
of all other parts of the human—human interaction necessary for collaboration. 

Whilst the annotations of LeMo I supported co-creation of music, they did gen- 
erate some issues. More specifically, making annotations and viewing them were 
reported to be very different from real life, daily experiences. Participants needed 
to get used to controlling strokes by pinching and releasing fingers. Besides, com- 
pared with writing or drawing with a real pen, the LeMo I has a less accuracy in 
supporting these. To increase the readability of written contents and sketches, par- 
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ticipants tended to write or draw in a bigger size, which resulted in a limitation of 
how much they could write/draw. But on the positive side, the larger size made it 
possible to write and draw together, which expanded the range of annotating action, 
making it less personal but more social-friendly and more accommodating to multi- 
ple people. Another unexpected problem found in this study was that 3D annotations 
can, of course, be viewed from many angles, so written text is often reversed for a 
participant’s collaborator, especially if they write in the space between themselves. 
This clearly decreases the readability of the annotations. Some participants wrote in 
reverse to try to compensate for this issue, see an example shown in Fig. 8.9h and i. 
Future development of the use of annotations in VR would need to explore how this 
mirroring issue could be addressed. 


8.5 Study IlI—Audio Approach: Augmented Acoustic 
Attenuation 


Sound attenuates as a result of diminishing intensity when travelling through a 
medium. Acoustic attenuation is one of the primary cues for sound localisation of 
distance; it enables humans to use their innate spatial abilities to retrieve and localise 
information and to aid performance, see [5]. Whilst augmenting the acoustic attenua- 
tion of a real medium (e.g. the air) is difficult, this can be easily done in VEs with the 
aid of audio simulation (refer to Chap. 3 for modularity in the auralisation). Research 
has begun to investigate the impacts of spatialised sounds on user experience in VR, 
see [27]. However, little research explores how the spatialisation of sound may affect 
or aid collaboration in a VR context. Considering sound is both the primary medium 
and the final output of the creative task [34], by affecting sound, different settings of 
acoustic attenuation can possibly affect the collaboration differently. With the ability 
to modify the simulated acoustic attenuation in an immersive virtual environment, 
we can possibly create sonic privacy by augmenting acoustic attenuation, and then 
use sonic privacy as personal space to support individual creativity in CMM. Sup- 
porting individual creativity is important as it contributes to the group creativity [37, 
44, 46, 60]. 


8.5.1 Hypotheses 


Research has suggested users should be allowed to work individually in their per- 
sonal spaces at their own pace, cooperatively work together in the shared space 
and smoothly transition between both of the spaces during collaboration [23, 56, 
65]. In a previous study [34], following this implication, we built three different 
spatial configurations (public space only, public space + publicly visible personal 
space, public space + publicly invisible personal space), and tested different impacts 
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of these spatial configurations on collaborative music making in SVEs. The results 
show adding personal space to be helpful in supporting collaborative music making 
in SVE, since it provides a chance to explore individual ideas, and provides higher 
efficiency in making notes. However, several negative impacts also showed up along 
with the addition of personal space, e.g. longer average distance between participants, 
reduced group territory and group edits [34]. We believe this might due to: (i) the 
separated stationary locations of the personal spaces forced users to leave each other 
to access, causing a longer distance between participants and less collaboration; (ii) 
the rigid boundary between public space and personal space made users more iso- 
lated, resulting in a higher sense of isolation. Thus allowing users to access personal 
space without leaving each other far away might eliminate these disadvantages. 

To make the shift between personal and public spaces more fluid, inspired by the 
implication that the separation between public and personal workspace should be 
gradual rather than too rigid [23], the attenuation feature can possibly be applied to 
form a gradual personal space, enabling a fluid transition between personal space 
and public space. This is because sound is both the primary medium of collaborative 
tasks and the final work of CMM [33], thus by manipulating acoustic attenuation, 
we can produce sonic privacy. Thus H1 was developed. 

H1: Attenuation can play a similar role to personal space with rigid form in CMM 
in SVE, providing collaborators a personal space and supporting individual creativity 
during the collaboration. 

Additionally, an acoustic attenuation, rather than a personal space with rigid 
separation from public space, enables a gradual shift between personal and public 
workspace, which may possibly increase the fluidity of the experience and support 
collaboration better, cf. [23]. Thus we developed H2. 

H2: Acoustic attenuation provides a fluid transition (no hard borders nor rigid 
forms) between personal and public spaces, which introduces less negative impacts 
on collaboration compared with personal space with rigid form in [34]. 


8.5.2 Independent Variable 


Spatial configuration is an independent variable in this experiment. Two spatial con- 
figurations were designed as the independent variable levels, as shown in Fig. 8.14, 
including the following: 


e Condition 1: Public space only (referred to as Cpub): where players can generate, 
remove or manipulate music interfaces, and have equal access to all of the space 
and the music interfaces. As no personal space is provided, a shift between public 
and personal space does not exist, i.e. users cannot shift to personal space. 

e Condition 2: Public space + Augmented Attenuation Personal Space (referred to 
as Caug). In addition to Cpub), the sound attenuation is augmented. The volume of 
audio drops much faster, creating a sonic privacy, which can be seen as a personal 
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Fig. 8.14 Top view of the two experimental condition settings 


space. As the volume changes gradually with the changes of distance, the shift 
between personal space and public space is gradual. 


8.5.3 Dependent Variables 


To identify how users use the space and the effect of adding augmented sound atten- 
uation as personal space, dependent variables were developed. The Igroup Presence 
Questionnaire (IPQ) was used to inform the design of questions about sense of 
collaborator’s presence [54]. The IPQ measures the sense of presence using one 
general measurement—sense of being there, plus three sub-scales covering spa- 
tial presence, involvement and experience realism. Questions about output quality, 
communication and contribution were adapted from the Mutual Engagement Ques- 
tionnaire (MEQ) [15]. The MEQ is formed of two parts: (i) participant ratings of 
the quality of the musical outcome and their interaction with musical interface; (ii) 
participant choices between different conditions when being provided a series of 
statements covering the music quality, enjoyment, involvement and frustration. The 
rest of the questions were designed to question people’s preference for conditions. 
The questionnaire included measures on: 


e Presence: (i) Sense of self-presence, (ii) sense of co-worker’s presence and (iii) 
sense of collaborator’s activities. 

e Communication: quality of communication, which may vary as the visibility of 
spaces can possibly affect the embodiment and nonverbal communication. 

e Content assessment: the satisfaction of the final music created reflects the quality 
of collaboration, cf. [15, 16]. 

e Preference: preference of the conditions, to see if users have subjective preferences 
towards the settings. 

e Contribution: (i) the feeling of self’s contribution; (ii) the feeling of others’ con- 
tribution. 
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Table 8.1 Results of Post-Session Questionnaire and the results of Wilcoxon Rank-Sum Test (two- 

tailed)* 

Questions (Measure) Cpub Caug Cpub vs Caug 
M SD M SD p Ww 


PSQ1 (support for creativity)—I think the space setting in this session was extremely helpful 

for creativity 8.55 144 8.77 1.34 0.5695 259 
PSQ2 (support for creativity)—I feel like the space setting in this session was extremely helpful 

to support the development of my own ideas 7.82 1.92 8.35 1.50 0.5211 255.5 


PSQ3 (preference)—I enjoyed the space setting of this virtual world very much 
8.27 1.61 865 160 0.2622 233 
PSQ4 (sense of collaborator’s presence)—I always had strong feeling that my collaborator was there, 
collaborating with me together, all the time 8.91 0.92 854 1.68 0.7961 298.5 


PSQ5 (content assessments)—How satisfied are you with the final piece of loop music you two 
created in this session 8.64 1.09 850 1.36 0.7644 300.5 
PSQ6 (communication quality)—How would you rate the quality of communication between 
you and your collaborator during the session 8.68 1.09 850 1.36 0.7644 300.5 
PSQ7 (sense of collaborator’s activity)—I had a clear sense what my collaborator was doing 
8.73 1.20 7.96 1.54 0.08094 368.5 


PSQ8 (amount of contribution)—The amount of your contribution to the joint 


piece of music is 8.41 144 8.15 1.46 0.4776 320 
PSQ9 (amount of contribution)—The amount of your collaborator’s contribution to the joint 

piece of music is 8.18 1.26 8.23 1.39 0.8486 276.5 
PSQ10 (quality of contribution)—What do you think of the quality of your contribution to the 

joint piece of music is 8.05 1.70 7.81 1.41 0.319 333.5 
PSQ11 (quality of contribution)—What do you think of the quality of your collaborator’s 

contribution to the joint piece of music is 7.73 152 8.19 1.20 0.3496 241.5 


a Note: statistics in this table are calculated based on the data collected from third and fourth session 
to better counterbalance the learning effect 


These measures are grouped into a Post-Session Questionnaire (PSQ, see items 
in Table 8.1). 


8.5.4 Participants and Procedure 


Fifty-two participants (26 pairs) were recruited through group emails at the authors’ 
university for this study.° Each participant was compensated 10 GBP for their time 
(roughly 1 h). Participants’ rating of musical theory knowledge is 3.92 (SD = 2.50) 
on a 10-point Likert scale, where higher values indicate increased knowledge; 24 


© The Queen Mary Research Ethics Committee granted ethical approval to carry out the study within 
its facilities (Ethical Application Ref: QMREC2005). 
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participants play one or more instruments, and the remaining 28 do not. Twenty par- 
ticipants had tried VR 2-5 times before, 20 had only tried once and the remaining 12 
had no VR experience previously. Thirty-seven participants knew their collaborators 
very well prior to the experiment; three met their collaborators several times, and the 
remaining 12 did not know their collaborators at all prior to the experiment. 

The experiment started with participants reading the information form and signing 
the consent form. Then they first received an explanation of the music interface of 
LeMo II (see Fig. 8.2), with all of the interaction gestures supported in LeMo II 
demonstrated by an experimenter. Next, a trial (roughly 5—15 min) session was carried 
out, where participants could try all of the possible interactions. The trial ended 
once participants were confident enough of all available interactions. The length of 
time of the tutorial session was flexible to ensure participants with diverse musical 
knowledge could grasp LeMo II. Participants were then asked to have four sessions 
of collaboratively composing music that was mutually satisfying and compliments an 
animation loop. Two of these sessions were set for this study; each covered a condition 
(Cpub/Caug), and the sequence of conditions was fully randomised to counterbalance 
the learning effect. We set each session as 7 min because based on our pilot study 
and a previous study [33], we found 7-8 min were sufficient for the task. In total, 
four visual, silent animation loops were introduced to trigger participants’ creativity; 
each to be played in one experimental session on four virtual screens surrounding 
the virtual stage. These clips were played in an independently randomised sequence 
to counterbalance impacts on the study. Each session ended with a Post-Session 
Questionnaire (PSQ, see Table 8.1). After all the four sessions finished, a short 
interview was carried out. 


8.5.5 Results 


Wilcoxon Rank-Sum tests were run to compare the ratings of Cpub with Caug collected 
by PSQ, see results in Table 8.1. No significant effect was found between Caug and 
Cpub. Post-task interviews revealed more reflective insights. Around 41,000 words of 
audio recorded interview responses were transcribed and a thematic analysis of the 
transcription was undertaken (more details about the thematic analysis in Sect. 8.4.3). 
As shown in Fig. 8.15, in total, 439 coded segments, 15 codes and 3 overarching 
themes emerged from the thematic analysis: (i) learning effects; (ii) preferences, 
advantages and disadvantages of conditions; and (iii) advantages, disadvantages of 
LeMo II and suggestions for improvements. Next, we will only cover the former two 
themes as the final one is not directly related to the scope of this chapter. 


8.5.5.1 Learning Effects 


Members of 18 groups mentioned the effect of the session sequence. Specifically, 43 
coded segments contributed by 27 participants were related to learning effects. For 
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Fig. 8.15 Ingredients of all the coded segments of the interview; number of coded segments are 
shown in the bars 


example, Participant 15A (participant A in group 15, referred to as P154 ) reported the 
sequence is an “important factor”. The first session was felt to be hard as they were 
“just being introduced to [the system and they were] still adjusting” to it (P54), trying 
to “[figure] out how the system was working” (P164), as they “were progressing into 
latter sessions, [they] felt easier to communicate and use gestures to manipulate the 
sound, being able to collaborate more, more used to the system” (Psp), these changes 
led to a higher level of satisfaction and more enjoyment in later conditions. To better 
counterbalance the impact of sequence, Table 8.1 only includes data collected from 
the latter two sessions (note: as aforementioned, there were four sessions that were 
randomly sequenced, and two of which were related to this study). 


8.5.5.2 Cpup—Simple but can be Chaotic 


With no personal space, participants had to hear all the interfaces throughout the ses- 
sion. In total, 16 coded segments are about the disadvantages of Cpub; some exemplars 
are: “a bit troubling’”—P},,/ “music always very loud”—Po,, “it was global music, 


7 Puig refers to participant B in group 11, similarly, Poa indicates participant A in group 9. 
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and there was someone annoying” —P>?,, “you are not going to say anything” to avoid 
being “rude”—(P2,). It was easier if there is something helpful “to perceive what I 
was doing, and not get confused with what [the collaborator] was doing” (P15g), it 
was too “chaotic” (P204), “too confusing” (P224 and P228), “annoying” (Psp). They 
“can not concentrate” (P25p) while “everything [is] open and quite noisy” (P268), and 
they “don’t have the tranquillity to operating [their] sounds or the everything’s come 
mixed, which is difficult to manage” (P22,). 

There were 25 coded segments from 14 participants reporting the positive side 
of the Cpub; some examples are: (i) pieces created in “personal space” might clash 
in a musical way (P14), “better to work when knowing how it sounds all together” 
(Pi7B), music pieces might match better; (ii) better for providing help to the other 
collaborator, as reported by P4A, saying that they needed someone to lead them 
and thus the ability to hear all the work all the time was helpful; (iii) “space wise”, 
compared with having to work closer to “hear the sound well” (P124) in Caug, Coub 
does not have this space constraint, they could chose to work “anywhere” (P244); 
iv) “easier” to understand the condition (Peg), fewer confusions when simply being 
able to hear all the things all the time (P134); v) “collaborative wise” (P134), less 
separation, better collaboration compared with “personal space” was provided (P3p, 
Piga and Pigg). 


8.5.5.3 Preference on Caug 


There were 35 coded segments contributed by 24 participants favouring condition 
Caug, higher than 12 segments contributed by 11 participants for Cpub. There are 111 
coded segments contributed by 33 participants from 25 groups reporting the advan- 
tages of this condition, much higher than the number of segments reporting other 
conditions’ advantages. These reports reveal some insights behind the popularity of 
Caug- Caug’s advantages reported by participants can be grouped into 4 groups: 


e Higher team cohesiveness and lower sense of separation. Participants reported 
that, without the rigid personal space, they had to “work with the other person” 
(Pea). With no rigid personal space, Caug’s “forces [them] to collaborate more the 
most because [they] had to stay very close to compose music ” (Pog). 

e An appropriate environment for creativity, more consistency and convenience. As 
described by participants, it was “a middle point between personal space and no 
personal space” (Pea), without even triggering something, “[they] could decide 
in a continuous way if [they] were able to listen to the other sound sources or 
not’, and “to what extent [they] want to isolate [themselves]” (P164). Compared 
with having to hear all sounds in Chub, this provided them a “less stressing” (P44) 
context, and they can selectively move away to avoid “getting interrupted with 
the other” (Psg) and overlapping music. Being able to still “hear a bit of it in the 
background but not completely” (P204 ) was reported good as this kept them “up to 
date” (Poa) and helped them to “tailor what [the participant] was making” (P228) 
to match the co-created music and to make something new and see if it “fit with” 
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(P204) the old. Caug provided them with “a little bit of personal space” although 
not a quite a “defined thing” (Pea), which provided the possibility “to work on 
something individually” but also being able to “share work quite easily” (P29, ). 

e Easier to identify sounds. Participants reported it was easier to “locate the source 
of the sound” (P164 ) and “perceive what [they were] doing” (Pisp), which helped 
them “understand instruments better” (P7g) and “not get confused” (P158); 

e More real. Interestingly, instead of Cpub, which simulates the real-world sound 
attenuation, Caug was reported to be “real”. “If you want to hear something, you 
just come closer, like in the real world” (Pj; and P118), “it was good like we were 
feeling like the real-time experience (P26p)”. 


It should also be noted that, along with these reports about advantage, there are 
19 segments reporting Caug’s limitations, including: (i) a preference “to hear all the 
instruments all the time” in Cpub (P26p), (ii) Caug might lead to “another type of 
compositions” and “influence the piece” (Pi¢g) and (iii) without being able to hear 
all sounds led to a feeling of separation (Pjs,). 


8.5.6 Discussion 


The issues from having no personal space are clear. Especially for the music-making 
task in this study, participants reported that without personal space, the auditory 
background could be messy to develop own ideas, and their creativity required a 
quieter and more controllable environment, which could be provided by personal 
space. Providing such an environment is crucial considering individual creativity is 
an important part of the collaborative creativity. Having personal space was reported 
to be “an added advantage” because it promoted their own creativity, which can then 
be combined and contributed to the joint piece. This matches the findings in [34], 
that providing personal spaces is helpful as it provides a chance to explore individual 
ideas freely, which then added an interesting dynamic to the collaborative work. 
However, adding personal space indeed brought a few impacts, next we discuss the 
impacts of using acoustic attenuation as personal space and its characteristics. 


8.5.6.1 Impacts of Adding Acoustic Attenuation as Personal Space 


As mentioned above, in the previous study [34], we found the addition of personal 
space located on the opposite side of the public space led to a shrunken size of group 
territory, fewer group note edits, a larger size of personal territory, more personal 
note edits, a larger average distance between collaborators and fewer times of paying 
attention to collaborator. We argued that these negative impacts are mainly due to 
the personal spaces distributed on the opposite side of the group space resulting in a 
larger distance between participants. So we proposed personal space with different 
features (e.g. gradual boundary—Caug) might reduce these negative effects. In many 
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ways, Caug is quite similar to Cpub, e.g. both do not have a visual boundary for spaces, 
so not surprisingly, no significant differences were found in most of the statistical 
measures, see Table 8.1, and most previously identified disadvantages brought by 
adding rigid personal spaces have been successfully eliminated; more detailed results 
are available in [35]. By making the personal space invisible and gradual, the isolation 
and difficulty of coordinating that introduced by the additional personal space was 
minimised. For example, in the interview, participants reported Cay, provided a proper 
level of group work as working context, making easier to create new that matches 
the old. 


8.5.6.2 Providing Personal Space with Fluid Boundary 


Although no significant differences were found in PSQ2, see Table 8.1, which ques- 
tioned the support each condition gave to individual creativity, Cayg has a higher 
mean rating. The thematic analysis revealed more insights. Caug provides both “an 
appropriate background” with which participants felt “less stressed” and were able to 
“tailor” the individual composing to match the co-work, and a space personal enough 
to “work on something individually”. No major differences were found between Cpub 
and Caug in PSQ, indicating Caug provides a very mild solution, with limited impacts 
on people’s collaborative behaviour introduced, whilst still providing sufficient sup- 
port for individual creativity during collaboration, thus H1 is validated. 

Compared with natural attenuation in Cpub, Caug’s augmented sound attenuation 
setting forced or prompted people to work more closely in order to hear each other’s 
work, as reported by some participants. Compared with adding personal space with 
visible rigid boundary, by enabling participants to “decide in a continuous way” 
(Pi6a) if they want to hear other’s work, an invisible gradual boundary in Caug led 
to less separation, and higher consistency between personal and public space. H2 is 
therefore supported. This finding also echos the implication proposed in [23] that 
there should be many gradations between personal and public space to enable people 
fluidly shift in between. Popularity—the code “advantage of Caug” has 111 coded 
segments, and the code “most favourite—C,,,.” has 35 coded segments, both are 
greater than what Cpub gets. All indicate Caug is the most popular condition. The 
popularity is also partially verified by that Caug has the highest mean in preference 
measure (PSQ3 in Table 8.1). We believe the reasons behind this popularity are 
mainly due to its unique advantages, which as reported by participants, include: 
(i) an appropriate environment for creativity, (ii) easier to identify sounds and (iii) 
perceived as more “real” (although it should be noted that Cpub is more similar to 
real-world audio attenuation). These features of Caug made it provide better support 
for collaborative creativity and therefore led to its popularity. 
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Table 8.2 Comparison between the two routes 


Modality 


3D Annotation 


Visual 


Augmented Attenuation 


Auditory 


Type of interaction 


Explicit interaction—active 
drawing 


Implicit interaction—passive 
body movement 


Supports for collaboration 


Supporting communication 
between users 


Supporting development of 
individual creativity 


Characteristic No influence introduced on Influence introduced on audio 
audio channel, users hear and composition, users do not 
roughly the same audio* hear the same audio 

Applications Wider range of application, not | Application restricted to 


restricted to audio tasks, audio 
tasks requiring precise audio 
output, or users with 


auditory tasks with no 
requirement for precise audio 
outputs 


hearing/speech impairment 


“Strictly speaking, what users hear still slightly differs unless the realistic spatialisation of audio is 
disabled 


8.6 General Discussion 


The two studies have explored 3D annotation and augmented acoustic attenuation’s 
role in CMM. This section compares the two approaches against each other, seeking 
the potential differences and finding out the usage scenarios. The comparison results 
are summarised in Table 8.2. 


8.6.1 Modality and Interaction Type 


3D annotation is a visual approach, while augmented attenuation is an audio 
approach. This fundamental difference led to their unique advantages and disadvan- 
tages, which then determine their scope of usage scenarios. Specifically, the visual 
approach can fully avoid influencing the audio channel, leaving that modality purely 
for composers to hear the project they are working on. While on the contrary, the 
audio approach imposes unavoidable effects on how the audio sounds, as the privacy 
is produced by augmenting the acoustic attenuation of the medium of the sound. 

Unlike 3D annotation, which requires explicit interaction to make 3D lines, the 
augmented attenuation in Study II only relies on users’ passive listening and active 
physical locating in space. Explicit interaction is consciously deciding to interact, 
e.g. clicking a button. It is what we normally think about when we’re interacting with 
a computer [55]. Compared with explicit interaction, implicit interaction does not 
require users to perform conscious actions; the interaction is mainly the movement 
(e.g. head movement and eye movement) of the user. As a result, the 3D annotation 
introduced a higher learning cost. 
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8.6.2 Key Support for Collaboration 


The 3D annotation helps people to warm up at the beginning, supports the non-vocal 
communication and provides help for collaborators to understand each other’s atten- 
tion. In other words, it supports the social aspects of the collaboration by intensifying 
the links between collaborators. While the augmented attenuation gives collabora- 
tors the choice to be separated, hence provides support for individual creation. With 
this flexibility, users have the choice to develop their own work and to switch fluidly 
between working on own and teamwork. 


8.6.3 Characteristic and Application 


3D annotation completely avoids impacts on the auditory channel. This supportive 
measure suits where the sonic output comes with stringent requirements, and users 
must be able to hear exactly the same final output during their working. Its application 
is not limited to sonic task because it provides support to communication, which is 
required by many collaborative tasks in SVEs. In contrast, the augmented attenuation 
has a narrower application range. It provides better support for individual activity, 
with still enough context of group work and the cost of hearing (slightly) differed 
output, making it only appropriate to audio related-tasks with no rigid requirements, 
e.g. people are improvising music for fun. 

These two supportive features do not necessarily contradict each other, and could 
be applied simultaneously. To manage the simultaneous use, a manipulation system 
might be needed. For example, the transparency of the visual 3D annotation and 
the degree of augmentation of attenuation can be adjusted to modify their impacts 
(visibility/audibility), fitting collaborators’ needs during different stages of the col- 
laborative composing. When only one feature is needed, the other can be adjusted to 
zero, Wiping out its impacts entirely. 


8.7 Conclusions and Future Work 


In this chapter, two different approaches to support collaborative sonic interaction 
in SVEs have been presented, one exploited visual modality and the other exploited 
audio modality. The results of both studies have been presented and reflected upon. A 
comparison between the two approaches has been made. Next, following the findings 
and discussion above, we propose six implications for supporting collaborative sonic 
interaction in SVEs, e.g. CMM. 


1. Adding a system that supports 3D annotation may be considered to aid collab- 
orator’s communication, especially if co-produced sound has to be prioritised 
over other modalities to avoid any impacts. 
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2. For audio-related tasks in SVEs, adding personal space should be considered, as 
it provides sonic privacy and essential support for the development of individual 
creativity, which forms a key part of the collaborative creativity. This is especially 
essential when the output of the task is vulnerable (e.g. audio), and co-workers 
need a space where they can think of own ideas and develop own work. 

3. For audio-related tasks (e.g. collaborative music making), manipulating acous- 
tic attenuation as personal space can be an effective way to allow users to shift 
between personal and public working space continuously by adjusting their rela- 
tive distance. With light-weight form, it introduces mild impacts compared with 
the prominent negative impacts introduced by rigid personal space [34]. 

4. The level of privacy can be adjusted by manipulating the level of augmentation. 
For instance, in Cag of Study II, participants adjusted their distance between 
themselves and collaborators to achieve different levels of being personal (herein 
referred to as “personalness’’). Instead of changing positions, adjusting the sound 
attenuation rate with distance can impact the level of “personalness” and there- 
fore producing a varied level of personalness. Potentially, adding a method allow- 
ing users to adjust the level might be useful so users can shift between having a 
“very personal and isolated” space and a “very public” space. 

5. Augmented attenuation can be exploited for creative audio privacy, which can 
be then used to promote individual creativity during the collaboration. However, 
augmented attenuation introduces differences in what collaborators hear, making 
it only applicable to contexts with no rigid requirements on audio outputs. 

6. We suggest that augmented attenuation and 3D annotation could be applied 
together or chosen with a flexible switch so that users can choose the feature 
fitting their needs during different stages of the collaborative composition. 


Future works concern an exploration of how multi-modal approaches can be 
applied simultaneously, and designing and applying tools based on other modal- 
ities to support collaborative sonic interaction in SVEs, such as visual modality. 
For each modality, it could be interesting to test how that sensory cue can be aug- 
mented/depressed to adjust the level of its influence. 
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Chapter 9 A) 
Spatial Audio Mixing in Virtual Reality PAE 


Anders Riddershom Bargum, Oddur Ingi Kristjánsson, Péter Babó, 
Rasmus Eske Waage Nielsen, Simon Rostami Mosen, and Stefania Serafin 


Abstract The development of Virtual Reality (VR) systems and multimodal sim- 
ulations presents possibilities in spatial-music mixing, be it in virtual spaces, for 
ensembles and orchestral compositions or for surround sound in film and music. 
Traditionally, user interfaces for mixing music have employed the channel-strip 
metaphor for controlling volume, panning and other audio effects that are aspects 
that also have grown into the culture of mixing music spatially. Simulated rooms and 
two-dimensional panning systems are simply implemented on computer screens to 
facilitate the placement of sound sources within space. In this chapter, we present 
design aspects for mixing in VR, investigating already existing virtual music mixing 
products and creating a framework from which a virtual spatial-music mixing tool 
can be implemented. Finally, the tool will be tested against a similar computer ver- 
sion to examine whether or not the sensory benefits and palpable spatial proportions 
of a VE can improve the process of mixing 3D sound. 


9.1 Introduction 


Mixing is the activity of placing and levelling sounds. When a sound source in the 
real world is placed spatially, the sound source’s distance and direction is digitally 
managed in the process of mixing and thereafter perceived through different cues. 
This is done on a mixing console, being the primary interface for mixing 2D music; 
the console includes additional parameters to manipulate the general relationship 
between the sound sources. The parameters are amongst other things panning, vol- 
ume, and equalisation, on each track. The mixing console is divided into functional 
sections, which constitute different metaphors [10]. As an example, each track has 
a channel strip, used as the main way to adjust the volume. The volume parameter 
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Fig. 9.1 Channel strip versus stage metaphor. Hand Motion-Controlled Audio Mixing Interface [23] 
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(also called a fader), is often built on a slide potentiometer and is seen as a universal 
metaphor for amplitude [3]. Simultaneously the pan potentiometer represents the 
track’s placement and spatial position of a source in a stereo mix. Here, the pan 
potentiometer maps the left and right position of a knob to the left and right location 
of a sound. In general, the mixing console can be seen as a metaphor on its own 
and it has brought with it visually rich interfaces and representations of controls via 
graphical faders and knobs [15]. A mixing console can be divided into two main 
categories: 


e Analog mixing console: it deploys a one-to-one mapping where every slider, knob 
and button has a dedicated function [10]. It is widely used and is known to be fast 
and intuitive. 

e Digital mixing console: it reduces the design of the analog mixer by introducing 
sub-menus and layers, breaking the one-to-one mapping. It is still built on the 
channel-strip metaphor but enables a smaller interface as a user can scroll through 
the tracks. It thus allows for much more possibilities but demands more effort from 
the user, while it also might be harder to learn and control in a live situation [10]. 


The general channel-strip metaphor has furthermore been the standard way of imple- 
menting a mixing interface in different Digital Audio Workstations (DAWs). 
Contrary to a mixer based on the channel-strip metaphor, either physical or digital, 
the stage metaphor/paradigm is a popular way of representing sound sources in a 
stereo field. In the stage metaphor, the level and stereo position (and possibly other 
parameters) are modified using the position of a movable icon on a 2D or 3D image 
of a stage [7] as seen in Fig. 9.1. This metaphor was first proposed by Gibson, who 
called it a ‘virtual mixer’ [12]. Even though the stage metaphor mostly uses one- 
to-one mapping in terms of volume and panning, it also incorporates one-to-many 
mappings, as the position of each sphere as an example can affect both the volume 
but also filtering and reverb in relation to the distance as mentioned earlier [7]. 
Both mixing metaphors come with drawbacks usability-wise. Firstly, it is obvious 
that when using several tracks on the original channel-strip mixing console, it can 
be hard to visualise and get an overview of the different tracks—all tracks that are 
panned to the left will as an example not necessarily be controlled by the faders 
and pan potentiometers on the left side of the mixer. This is easier in the stage 
metaphor, where each source is graphically visualised in a 3D space. However, the 
stage metaphor suffers from organisational consequences as all tracks are scattered 
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around the virtual room. With the channel-strip metaphor, each audio channel is 
always located in the same place, which makes it easy to find and control [10]. 

As mentioned by Gibson, the way humans perceive sound, besides physical sound 
waves and directional cues, is by imagining sounds between two speakers [12]. ‘Imag- 
ing’ works as a substitute of actually seeing the source that produces the sound [12]. 
Mixing engineers use both sound pressure as a perceptional tool when mixing, but 
also imagination, as it allows the engineer to create a wide range of dynamics [12], 
through the likes of asymmetric panning or uneven volume relationships, when visu- 
ally placing sound sources ‘between’ the speakers [12]. Especially here the stage 
metaphor and the general visual representation thereof serves as a helping tool and 
underlines the importance and usefulness of VR in such a connection. 


9.2 Audio-Visual Interaction 


Vision is an important element in sound localization. As stated by Yantis and 
Abrams [28] and extensively discussed in Chap. 10, when conflicting information 
about sound localisation is given to both the auditory and visual system, the visual 
information dominates the perception. This effect, called ventriloquism effect, makes 
the person perceive the sound coming from the location determined by the visual 
system. 


There are, however, three factors that can bias the visually perceived location [24]: 


e The visual and auditory events must be close in time. Ideally, the visual event 
should happen before the auditory event. 

e The events should be plausibly linked. In other words, the sound must be something 
that could have come from the visual source. 

e The visual and auditory events must be plausibly close together in space. An 
example of this is if the sound is played through headphones, but the visual source 
is located behind the wall, you are likely to perceive the sound coming from the 
headphones and not the visual source. 


In a study by Tabry, et al. [26] subjects were asked to localise a sound source by 
either pointing in the direction with their head or hand. There were two conditions: 
one where they were blindfolded, one where they were not. The results showed that 
the subjects were able to better localise sound on the horizontal axis than the vertical. 
Moreover, the subjects localised the sound more accurately when not blindfolded. 
This supports Abrams and Yantis’ statement that humans, in some way, rely on both 
visual and auditory stimuli when it comes to sound localisation. 

Furthermore, it was stated by Tabry, et al. that the results suggest a greater depen- 
dence on visual cues for orienting one’s head towards a specific location in space 
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than for orienting one’s arm [26]. It can be argued that this fact supports the use 
of VR when mixing spatial audio as it gives the user visual feedback of the sound 
localisation. 


9.3 Designing Computer Music in VR 


While there has been a lot of investigation in the world of designing VR for com- 
puter music, especially within the field of Virtual Music Instruments (VMI) and New 
Interfaces for Musical Expression (NIME) [16], little focus has been on graphically 
representing mixing, mastering and audio effects processing. With VR defined as 
an immersive artificial environment experienced through technologically simulated 
sensory stimuli [25], there is no doubt that this, combined with VR’s inclusion of 
multidimensional spaces and free rotation/movement, enables the possibility of visu- 
ally placing, moving and mixing sound sources in a 3D space. Concerning VMIs, 
Perry Cook states that copying an instrument is dumb, leveraging expert technique 
is smart [6]. This principle is, in particular, relevant to interfaces in VR as its visual 
qualities and lack of physical limitations can be used as a tool and paired with exter- 
nal applications like DAW or real-world speaker systems. The following section will 
focus on the aspects that might facilitate this and outline important guidelines for 
designing a VE for sound synthesis and mixing. The most important principles will 
include technological considerations such as latency and cybersickness, interaction 
types and possibilities, modelling sound in a physical space, as well as the overall 
graphical representation of the system. 


9.3.1 Technology 


There are multiple aspects to consider when designing virtual spaces and applications 
in VR in general, such as ensuring smooth interactivity through minimum latency as 
well as by preventing cybersickness. In a real-world modelling ideal it is preferable 
with a latency of 15 ms or less, when moving a head or object to see a new and 
corrected view of the scene. However, a lot of the head mounted displays (HMDs) 
only achieve a latency of 30 ms [21]. Getting as close as possible to the latency 
limits is important as synchronisation between the arrival of stimuli in different 
modalities is known to influence the perceptual binding that occurs in response to an 
event producing multimodal stimulation [17]. As individual senses in a virtual world 
still are not represented independently, synchronised audiovisual feedback is not only 
important as it serves as a response to a user’s actions, it also creates a bridge between 
an activity and its given sound [17]. In 1998, Miner and Claudel [19] investigated the 
sensitivity of delay of auditory stimulus in multimedia applications. According to 
their analysis, requirements for sound-synthesis simulation of environmental effects 
like reverberation, Doppler Shift, and the generation of 3D sounds are at least 66 
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ms. It is thus clear that high latency when manipulating sound in a virtual 3D space 
will affect both the user experience and the overall perception of sound. Latency 
is as earlier mentioned believed to increase cybersickness, which also is affected 
by aspects such as display flicker and wrong calibrations. To prevent cybersickness, 
especially in an environment dealing with movement and placement of virtual objects, 
one-to-one mapping between virtual and real translations/rotations is advisable, as 
the vestibular system, in particular, is sensitive to such motion [25]. 


9.3.2 Interaction 


Considering the interaction in a VR system, categories within the field of both user- 
orientation and user experience have to be examined. To fulfil good interaction in 
computer music interfaces, the musician, computer scientist, and designer Ge Wang 
suggests that the system amongst other things should [27]: 


1. Be real-time if possible. 

2. Design sound and graphics in tandem and seek salient mappings. 
3. Hide technology and focus on substance. 

4. Introduce arbitrary constraints. 


In general, this means that interaction with sound sources should be easy, quick, 
streamlined and noticeable and that virtual objects need to match location and motion 
of auditory objects. Simultaneously, the user should not be confronted with technol- 
ogy or implementation, to increase excitement and interest. Another thing that will 
support the user’s interest, but also immersion and virtuosity, is feedback. Various 
studies state that especially haptic and tactile feedback allows a user to develop 
musical skills and understanding of controls [25]. Inclusion of external controls that 
allow for touch or vibrational feedback thus could be beneficial. Gelineck et al. 
investigated this by comparing the stage metaphor (iPad App visualising a stage) to 
the channel-strip metaphor (normal faders and panning) when completing a stereo 
mix [11]. While they concluded that there was no significant difference in terms of 
performance, the iPad application was user experience-wise preferred for its intu- 
itiveness, enjoyability and its ability to reveal the spaciousness of the mix [11]. They, 
however, outline a side effect of representing the mix visually, with the fact being 
that it might take away focus from listening. It thus is important to find the right 
balance of the graphical representation and haptic/tactile feedback, in order to keep 
the focus on the main aspects: mixing and listening to sound. 

Using the strengths of VR will possibly improve this interaction. As mentioned 
by Serafin et al. it is believed that virtual reality shows the greatest potential when 
facilitating experiences that cannot be encountered in the real world [25]. This leads 
to the principle of considering natural and magical interaction in the system. The 
principle suggests that combining natural interaction (normal feedback to real-world 
movements) with magical interaction (interaction that is not limited by real-world 
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constraint as flying and teleporting) will open up for new and non-traditional inter- 
action possibilities for already realised interfaces [25]. 


9.3.3 Sound in Space 


Looking at the space in which the sound will be virtually presented, VR has several 
possibilities as introduced in Chaps. | and 6. Sound itself will in different physi- 
cal spaces be shaped by the room’s spectral characteristics and modified by room 
properties such as size, material, and shape. One can choose different methods when 
employing the models of spatialisation to the virtual rooms adjustment of the sound. 
Robert Hamilton distinguishes between two main models: the user-centric perspec- 
tive and the space-centric perspective [14]. In the user-centric perspective, the sound 
will be manipulated from a first-person point of view, where sounds in the virtual 
world will correspond to a real-world-based model of hearing: they will be placed 
in a general aural spectrum known from the everyday, with corresponding depth 
cues implemented through filtering and delay. This can be done by tracking the 
coordinate distance between event locations and the user’s in-game avatar [14]. The 
space-centric perspective, on the other hand, shifts the focus to the sound itself cor- 
related between the virtual and physical world. In this model, sounds are no longer 
contextualised based on their proximity and relationship to a given user [14]. Instead, 
they are processed in relation to both the virtual and physical world, meaning the 
placement in each environment (as an example a spatialised speaker system) will 
affect it. This allows for multiple users and a communal experience [14]. In rela- 
tion to the user-centric perspective, Gédde et al. [13], with a focus on the cinematic 
narration in VR, describe two possible ‘user-centric’ roles: a passive role, where the 
viewer is only an observer with no connection to the scene [13]—here the experience 
is more laid back and requires lower involvement resulting in focus on narration and 
the environment, and an active role where the viewer is part of the scene [13 ]—here 
the experience is involving resulting in a higher potential of presence that however 
might take focus away from narration and environment [13]. 

Besides handling the geometrical aspects of a virtual room, such as a sound 
source’s spectral position and distance from the listener, it is also necessary to include 
a simulation of the acoustics of the given room. This will ensure a VR application that 
realistically represents the perception of sound. As stated by Falch et al. incorporating 
room simulation in binaural sound reproduction systems is important to improve 
localization capabilities as well as out of head localizations [30], which undoubtedly 
indicates the importance of acoustics when replicating binaural sound. 
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9.3.4 Graphical Interface 


In relation to the graphical and visual representation of objects and environment in 
a 3D world, Wang proposes four aesthetic principles [27]: 


1. Simplify: Identify core elements, trim the rest 

2. Animate, create smoothness, imply motion: it is not just about how things look, 
but how they move 

3. Be whimsical, organic: glow, flow, pulsate, breathe: imbue visual elements with 
personality 

4. Aesthetic: have one; never be satisfied with ‘functional’. 


Since it is known that spatial audio approaches tend to facilitate interaction that is 
intuitive and familiar, the above principles are important as they can further enhance 
this. Especially the characteristics of simple and organic elements, as well as ani- 
mating smooth motion of objects, will increase the user experience. This can as an 
example be done through the addition of shader programs as they aim to make virtual 
objects similar to their real counterpart, as shape, behaviour and appearance [18]. 
Shader programs are mainly used for the adjustment of a scene’s illumination, post- 
processing or special effects [18], and the two most known shader types are Vertex 
shaders: the process that performs the transformations of vertices and texture coor- 
dinates from object space to window space, and, fragment shaders: A pixel shader 
that takes care of how the pixels between the vertices look [22]. 

Besides the principles of Wang, Gale et al. furthermore suggest that one especially 
should avoid visual clutter, meaning too many objects potentially overlapping and/or 
occluding each other on the screen [29]. As a part of object cluster and general control, 
Serafin et al. state that it additionally can be enhanced by visually representing 
the player’s body [25]. People cannot see their own body in VR and this can be 
overcome by generating a visual substitution of a person’s real body seen from 
first-person perspective. This will create a visual illusion and result in a ‘virtual 
body ownership’, which allows users to get the necessary presence that successful 
feedback requires [25]. However, different visual representations of the body will 
create different interaction expectations. A realistic representation of hands has as an 
example proven to create a more natural interaction experience than the given system 
allowed [2]. Thus the appearance of the virtual representation and the expectations 
it produces is important to consider. 


9.4 Existing Mixing Interfaces 


Different programs that have been created and used for mixing audio for 3D will 
be examined in this section. The focus will be on the implementation, design, and 
usability of the systems. Furthermore, features and standards of the existing programs 
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will be examined in order to find inspiration and reach a state-of-the-art level for this 
project’s product. 

‘Auro Technologies’! is a company that aims to create the next generation audio 
standard by becoming the leader in state-of-the-art sound. They offer a product that 
can be used in the game and film industries as well as for mobile and automotive 
industries . This is made possible with AAX plugins, which allow the user to mix 
for a 11.1 system where the approach is to treat different elevation angles as layers, 
‘lower’, ‘height’, and ‘top’. Through algorithms, the audio is backward compatible 
with 5.1 and 7.1 systems. 

The plugin offers an overview of where each speaker is located in a 3D space as 
well as displaying modifiable parameters, such as depth of reverb, bass and treble 
equaliser, and volume of the sound. 

While mixing, each individual audio track in the session has a relevant plugin 
inserted. These plugins include ‘Auro-Panner’, ‘Auro Bus’, ‘Auro-Mixing Engine’, 
and the ‘Auro-return’. The Auro-3D system is thus comprised of several plugins 
which, furthermore, requires a processor called ‘A3DHost’ to be running in the 
background while working with Auro plugins, as well as ‘Auro-Dmix Control’ in 
order to down-mix the bounce to a specific format. 

Objectively, the approach of Auro-3D can be problematic as it may require a 
significant number of plugins to be running at the same time in a large project, which 
will affect the processing power. A powerful computer is therefore needed for it to 
be used with low latency. Simultaneously, one might argue that it is troublesome 
and counter-intuitive to individually place several plugins on each audio channel. An 
additional downside to this product is that it only works with Pro Tools, and only on 
Mac computers , which can eliminate a large number of potential users. 

However, the product is still used heavily in the industry , and has won multiple 
awards.” Especially the design of it is important to have in mind when it comes to the 
product of this project. Even though it requires multiple plugins on each individual 
audio channel, the plugins have a clear user interface that highlights affordances and 
uses signifiers and feedback to give the user an understanding of what can be adjusted 
and modified. An example is the effects controls. Firstly, they are designed to look 
like knobs with labels above them to signify what each knob controls.’ Additionally, 
there is feedback in the middle of the knobs to show which value they are set to, which 
furthermore is highlighted with lights around them to show where on the rotation 
axis they are set. This light also visualises in which range the knobs work and their 
boundaries in both directions. Some sliders control the volume of both the sound 
source and the amount of the reverb, which is a common way to control volume 
when working with audio. Even though the interface is not in 3D, it can be seen that 
Wang’s aesthetic principles (Sect. 9.3.4) are relevant. Only essential settings for the 
volume and reverb are visible and modifiable (simplification), the sliders have value 


l https://www.auro-3d.com. 
? https://www.auro-3d.com/about-us/mission/. 
3 Knobs are commonly used on audio-related products and mixers. 
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feedback and the knobs have both value and light feedback (animation), which gives 
the interface and its controls a simple design that is easy to get an overview of. 

Another product, although not commercialised, was made by Wakefield and 
Gale [9]. Their product was created in a research on how to solve perceptional 
problems in the 3D stage paradigm/metaphor when it comes to mixing audio. When 
more audio tracks are added, the visualisation can soon become cluttered, which 
causes problems in relation to depth perception that will be limited, leading to diffi- 
cult interaction [9]. Furthermore, they wanted to minimise the risk of ‘gorilla arm’, 
which is a term referred to when users keep their arm elevated for a long period of time 
[5, 9]. Wakefield and Gale created an environment in VR for mixing audio. The sys- 
tem allows multiple options for adjusting each audio signal. There are send-effects 
control, filters, equaliser, volume, and pan parameters. All this is controlled with 
one controller. According to their studies, the VR mixing interface may have helped 
with depth perception of the audio. However, it did not improve clutter and object 
occlusion [9]. It may be reasonable to think that the UI of the program has affected 
those problems. Only one audio track is in the environment but the effect controls fill 
out almost the whole screen. So, even though the necessary parameters are present 
and no mentioning from participants of them being problematic, the displaying and 
arrangement of them might be something to keep in mind when designing the UI for 
the product of this project. With Wang’s aesthetic principles in mind, it is clear that 
the interface is neither simplified nor aesthetic. 

Dear Reality is a German company which specialises in creating ‘ultimate tools 
for immersive 3D audio production’.* They offer multiple products under the name 
‘dearVR’ for game engines, controllers, and DAWs . The ‘dearVR Pro’ product offers 
full 360° manipulation of sound with built-in acoustics and reflections controls 


9.5 Target Group 


Since this project aims to develop an aid for mixing spatial audio in VR, the user is 
expected to have previous experience in mixing audio but not necessarily spatially. 
This will allow the user to be aware of the given possibilities that a product facil- 
itating spatial mixing gives (panning, volume change in depth, filtering as a result 
of elevation), but still explore the product as an entity. The product thus can be tar- 
geted at different groups ranging from game developers wanting to quickly sketch an 
audio-based atmosphere for their in-game environment, to music composers mixing 
spatialised audio for surround sound or VR applications and experimental musicians 
wanting to explore the use of 3D sound. 

Since VR offers sensory feedback and spatial proportions differently to a desktop 
application, and since it is known that programs using the stage metaphor are intuitive 
(see Sect. 9.3.2), an everyday use of the end product could target composers and 
producers, that eventually might need a quick and easy assisting tool for the audio 


4 https://www.dearvr.com. 
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spatialisation process, be it music for film, sound design or audio for games. With 
this scenario in mind, the target group, therefore, covers both hobby producers, semi- 
professional producers as well as professional mixing engineers and composers, 
etc.—as long as they are familiar with mixing. To further understand the needs of 
a producer or composer mixing music for spatial media, an expert interview was 
conducted. An extensive questionnaire was sent to audio engineer Gestur Sveinsson, 
from the recording studio ‘Studio Syrland’ in Reykjavik, in order to get his opinions 
on necessities when mixing audio spatially. Having worked with surround sound 
for both cinema and music, Sveinsson notes the importance of having touchscreen 
mixing tools that allow him to quickly and intuitively translate his idea into reality. 
He furthermore adds that a visualisation tool for audio placement indeed would make 
sense as long as it is based on the idea of analog faders and panning knobs. In relation 
to his personal workflow, Sveinsson states that he visually sees the angles of the sound 
sources on a screen and arranges the mix without having to turn his head towards the 
angles that a given sound is coming from. Nonetheless, he thinks that a face tracking 
system would be useful and especially from a consumer point of view, it could make 
the experience ‘hyper-realistic’. In relation to his personal preferences, he rates the 
aesthetics and design as well as the intuitiveness, and thus the time it takes to do a 
mixing task, as very important aspects in a mixing device, whereas the precision of 
it comes secondary. 

With this knowledge in mind, the virtual mixing tool thus will be implemented 
using the research information above to target composers or producers that can use 
the mix both quickly, intuitively and precisely. 


9.6 Conceptual Overview 


The user is placed in a 3D VR environment where different tracks from Ableton Live 
are represented as spherical sound sources in space with belonging labels. A ray is 
cast from the user’s controllers to signify which sound source one interacts with. If 
the ray is positioned on a sound source, the controller will respond with vibration to 
signify contact between the ray and a sound source. After selecting a sound source, 
the user can now move it in space. Data on position, distance, and angles will be 
passed into Max by Cycling 74 where the auditory placement in space will happen. 
This will result in the visual locations, as well as the auditory locations of the objects 
to match, and give an audiovisual experience in space as well as create a tool for 
users to visually place and mix different sound sources spatially. 


9.7 Virtual Environment 


To keep the centre of attention on the sound source’s spatial proportions, the environ- 
mental setting will consist of very few props being a stage and virtual objects linked 
to audio tracks. It has been decided to design and model a stage-like environment 
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Fig. 9.2 Sketch of the design in its initial stage 


as the main aspect of the surroundings, as this might elicit the stage metaphor and 
the virtual mixer as explained by Gibson in the analysis. This will allow the user to 
manually place objects in space with respect to a realistic and intuitive behaviour of 
the sound sources in the scene, like changing size according to distance or lighting up 
when pressed. Design-wise, it has been decided to develop an environment where the 
user is centralised in the scene on a slightly elevated cylindrical surface, to emphasis 
its role as a mixer/positional conductor. A circular truss will be positioned above 
the user (Sect. 9.7.1) to further highlight the ‘stage’ look. An illustration of both the 
initial design of the stage, truss as well as sound sources depicted as spheres is shown 
in Fig. 9.2. 

To focus on the ‘substance’ principle, the scene will consist of: (1) A simple 
yet aesthetic environment to focus on the importance of the mixing task. (ii) The 
different objects’ relations to the localisation of sound. (iii) Picturing spheres around 
a stage, to use the benefits of the stage metaphor like ‘intuitiveness, enjoyability 
and its ability to reveal the spaciousness of the mix’, as earlier stated by Gelineck 
et. Al (Sect. 9.3.2). To keep an aesthetically pleasing, yet simple design that guides 
the user’s attention to the mixing task rather than the visuals, it has been decided 
to avoid scenes with myriad different elements such as concert halls and theatre 
stages, as this might take focus away from listening. To further enhance the feeling 
of spaciousness within the environment, it has been decided to use a ‘grid-like’ 
structure on walls, floors and ceilings. This is, as seen in the state-of-the-art section 
(Sect. 9.4), a widely used technique to display and give the user a sense of dimensions 
and will automatically create spatial constraints as it outlines the boundaries of 
the room, giving possible distance limits in relation to sound source placement. 
These constraints are furthermore supported by the truss, which represents the outer 
boundary of object placement—the user cannot place sound sources outside of the 
truss area. 

An aspect that is widely used especially within cinematic VR, also called 
360° videos, is the use of cues to guide the user’s attention, as the user can freely 
rotate its head and thus choose the field-of-view (FoV) [20]. Whereas these cues 
normally focus on storytelling and narration, cues like implicit diegetic cues are 
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also applicable within this exact environment. Implicit diegetic cues are factors like 
objects or props within the scene that implicitly guide the user to do something [20]. 
This is, as an example, seen in the truss acting as an environmental constraint as men- 
tioned above—the truss is a barrier signifying the limitation of object placement and 
thus forcing the user to re-orientate. The sound sources themselves are furthermore 
important implicit diegetic cues as they give the user a sense of their placement and 
space when the user has to redirect attention. The sound sources act between being 
onscreen and off-screen diegetic cues according to when they are present within the 
FoV [4]. As the spheres serve as spatial guidance, both their look, sound, and feel are 
of high importance when it comes to attention leading and general orientation within 
the environment. In relation to this, it has, as an example, been chosen to make the 
different sound sources light up when hovered over by the user, as this gives the user 
a better overview of the mix, as well as avoiding confusion by changing colour when 
something is ‘soloed’—the act of isolating a sound. 

As mentioned, these cues simply affect the FoV and thus also the ‘user-centric 
perspective’ that manipulates the sound from a first-person point of view and con- 
textualises it to the position of the user. The placement of the camera, which also 
serves as the perspective of the user, therefore has to match the general viewing posi- 
tion. This will be done in Unity using the camera as the viewpoint. For the vertices 
and fragments of objects and shaders, the ‘user-centric perspective’ will be handled 
using the ‘object to clip’ node, which transforms a position in object/local space to 
the camera’s clip space. 


9.7.1 Rendering and Lighting 


The ‘Lightweight Rendering Pipeline’ (LWRP) in Unity will be applied to render 
the scene and its light. Since the Oculus Quest used is dependent on its hardware, the 
LWRP will be optimal as it targets a broad range of mobile platforms, VR and games 
with limited real-time light capabilities.» By making a few trade-offs in relation 
to lighting and shadows, like fewer draw calls, the LWRP optimises the real-time 
performance of the system thus allowing for uncomplicated real-time processing and 
salient functional mappings, which was mentioned as important design requirements. 

In relation to the lighting within the scene, general directional lighting is used 
to illuminate the environment. The lighting was chosen to be coloured to add to the 
atmosphere within the scene. Coloured lights were simultaneously used as decoration 
within the scene, where bars in red and blue represent LED strips. The emission of 
white rings on the walls and in the surface additionally adds light to the scene and 
through global illumination, surface reflections were simulated. To enhance the ‘stage 
aesthetics’ even further, coloured fog was included, as fog is usually experienced 
within concert experiences. 


5 https://docs.unity3d.com/Packages/com.unity.render-pipelines.lightweight @4.0/manual/index. 
html. 
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9.7.2 Interaction 


The VR system contains three main interaction types and their respective feedback, 
including visual and auditory feedback, which is a combination that reinforces a 
user’s given action, meaning that the user both sees and hears the results of the actions 
made. The different interactions and their feedback, as pictured in the conceptual 
overview above can be explained as: 


e Touch/haptic interaction: the selection and manipulation of the different sound 
objects will constitute the haptic interaction. Here the user is allowed to touch/ 
select, by pressing a button and move, by moving its arm, the different objects. 

e Visual feedback: an object will light up corresponding to it being clicked/ selected, 
and move corresponding to the user’s force and arm movement. The visuals are 
thus designed with a focus on natural mapping as the sound sources, with respect 
to their auditory perception, are placed where they would be placed in a real-world 
situation. 

e Auditory feedback: the panning and the volume of the chosen sound source will 
change accordingly to the placement of the object both on the horizontal axis 
(azimuth) and depth (distance). 

e Tactile feedback: vibration will happen when the user hovers over an object to 
signify its allowance of being selected. This is to help the user find and aim at the 
desired object. 


To sum up, it can be said that the visual feedback of the haptic interaction facilitates 
the stage metaphor and virtual mixer analogy, as sound sources are positioned in space 
relative to the user, whereas the auditory feedback facilitates the binaural synthesis 
and combine the interplay of visual and auditory cues used in human perception. The 
tactile feedback, on the other hand, constitutes increased usability of the product and 
the potential of the user to, within the environment, gain skills and understanding 
of the different controls. Additionally, it acts as a substitute of the mixing console, 
which as earlier mentioned also is a tangible controller. How the user scrolls through 
the audio tracks, and visually as well as auditory pans and levels them, is now an 
integrated part of the VE, rather then the mixing console. 


9.7.3. Shaders and Visual Appearance 


In Fig. 9.4, the visual appearance of the final environment is shown. This design was 
reached from aesthetic and stylistic ideas received from different scenarios seen in 
the mood-board below. Inspired by the ‘stage metaphor’, spheres were used as sound 
sources, instead of objects picturing the actual instrument/object the sound source 
is coming from. This was done to avoid unrealistic representations of the sound 
sources, which potentially could create user aversion and additionally introduce 
latency problems for the Oculus Quest. The spheres were furthermore chosen as 
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Fig. 9.3 Inspirational mood-board for visual appearance and colours 


the main audio representation as they could constitute an abstract feeling to the very 
artistic subject that music and mixing is, as well as being used in products such as 
the dearVR (Fig. 9.3). 

As seen in the mood-board, the colours blue and red, as well as the effect of lasers 
serve as a big inspiration for the look of both the environment and the shaders. 


9.7.4 Audio Design 


For the audio, head related impulse responses (HRIR) from MIT were used.° The pack 
includes IRs ranging from —40° to +90° on the vertical axis where each elevation had 
their own IR for the azimuth (5 degrees between each IR). Each IR was measured at a 
distance of 1.4 m. As this pack consists of 710 different IRs, the computation would 
both be heavy and complicated and, therefore, it was decided to evaluate whether it 
was needed to implement the IRs for elevation, as humans have perceptual difficulties 
placing audio on the vertical axis. 


9.7.4.1 Can We Remove Auditory Elevation Cues? 


A total of 14 participants were gathered for the evaluation, which was set up at 
Aalborg University in Copenhagen. The participants were informed of the research 
question ‘Do you feel like the sound is matching the position of the object?’ before 
the test started and asked to answer either ‘yes’ or ‘no’, with the option to hear the 


6 https://sound.media.mit.edu/resources/KEMAR. html. 
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sound again, if needed. Additionally, they were encouraged to focus on a sphere 
centred in the middle of the screen, but they were allowed to look around. 

The threshold of accuracy was set to 80% for scenes with no elevation, and 50% for 
scenes with elevation. There was no audio manipulation (volume change, filtering) 
for the elevation. The results showed that 91.4% of the participants felt that the audio 
matched the position of the object with 0° elevation and 95.2% felt the audio matched 
the object’s position when it was elevated. 

The results from this evaluation show that having visual cues for the audio source 
made the audience interpret the sound to originate from the visible object. Even 
though no audio manipulation took place regarding elevation, overwhelming majority 
perceived the audio to be elevated. Therefore, it was decided to not implement the 
HRIRs for the elevation, and instead, sound design-wise only relies on azimuth and 
distance cues. 

Even though no audio manipulation was implemented for this particular test, it 
is important to implement acoustics manipulation relevant to the environment. This 
relates both to volume change based on distance (Inverse-square law) and rever- 
beration based on the dimensions of the room. Both reverb and low-pass filtering 
will, therefore, be added to change timbral properties and give the sound an appli- 
cable aspect of room acoustics. The properties will be applied based on subjective 
sound-aesthetics and not by using physical models. 


9.7.5 Summarising Design 


Conclusively, to keep the attention of the user focused on the sound sources location, 
the VR will consist of very few props consisting of a stage and virtual sound sources. 
The stage metaphor, as well as a grid shader for walls and floor, is used to reveal the 
spaciousness of the mix as well as the VR. Other shaders, such as a fresnel effect, are 
used to signify user action as well as provide aesthetic value. Using theory from cine- 
matic VR, sound sources can be considered on- and off-screen diegetic cues, guiding 
the user’s attention, while the view facilitates an active user-centric perspective. To 
optimise the performance of the system the rendering pipeline ‘LWRP’ is used, while 
it also was decided to keep the lighting of the scene relatively simple. A combination 
of haptic, visuals, auditory and tactile feedback is used to enhance usability. Spheres 
were chosen to represent sound sources to constitute an abstract aspect of audio 
mixing and imaging as explained by Gibson. Based on the conducted evaluation of 
perception of elevation, showing that an overwhelming majority perceived auditory 
elevation based on visual feedback only, it was decided to only implement HRIRs 
for azimuth. Simple acoustics manipulation will furthermore be applied to simulate 
distance of sound sources. An illustration of the final environment, including colours, 
shaders and lighting can be seen in Fig. 9.4. 
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Fig. 9.4 Final design of the environment 


9.8 Implementation 


The implementation of the interactive VR environment and its inclusion of dynamic 
binaural synthesis consists of different steps and programs: 


1. Firstly, a combination of the Oculus Quest system and the game engine Unity 
will be used to create a 3D environment that allows the user to manipulate and 
position objects within a virtual space. Support for VR in Unity will be imported 
through asset store items, in this case, Oculus Integration is used.’ 

2. Secondly, object coordinates, angles and user head rotation, will be implemented 
and retrieved based on different scripts. This will be sent through Open Sound 
Control (OSC) to Max via User Datagram Protocol (UDP) connection. An addi- 
tional Unity asset store item called ‘OSC simpl’ is here used.® 

3. Finally, Max and its live integration with Ableton Live will execute real-time 
sound rendering and binaural synthesis, through a convolution process of differ- 
ent HRIRs related to the respective sound object angles. 

4. The communication between Unity and Max will furthermore be emphasised, as 
the track/audio names from Ableton Live will be displayed as part of the sound 
sources in the VR environment. This will additionally be implemented through 
OSC communication. 


UDP is a connectionless communication protocol used across the Internet, espe- 
cially for time-sensitive transmissions and is considered a quick communication 
protocol, as it allows data transfers before the receiving party agrees to the commu- 


7 https://assetstore.unity.com/packages/tools/integration/oculus-integration-82022. 
8 https://assetstore.unity.com/packages/tools/input-management/osc-simpl-53710. 
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Fig. 9.5 Illustration of the system 


nication.” OSC is, likewise, a protocol especially for networking sound synthesisers 
and computers. It uses UDP to transfer data within local subnets and is thus an obvi- 
ous protocol for UDP communication. An overview of the different stages, systems 
and software used can be seen in Fig. 9.5: 

The process of representing real-time audio spatialisation is done using a range of 
scripts developed to ultimately pass the necessary information from Unity to Max to 
create a spatialised mix. 


Convolution 


The convolution of the incoming signal and the signal of the different HRIRs is 
done in the frequency domain using the pf ft ~ object. The pf ft~ object essen- 
tially is a processing manager that splits the FFT process into smaller tasks, each 
taking care of their own FFT process. 


Distance Simulation 


While it earlier was confirmed that humans have a hard time distinguishing between 
elevations of sound, especially being accompanied by a visual object, the simula- 
tion of distance is easy to perceive and important both in relation to the display and 
localisation of sounds. In this project, it has been decided to use the Inverse-square 
law to simulate distance. In relation to the difference of sound in each ear, due to the 
acoustic shadow of the head, this project only takes the interaural time difference 
(ITD) and sound intensity in space into consideration. The frequency dissimilarity 
and the ITD at longer distances (the ITD is covered through the HRIRs at closer 
distances) are thus not considered. This has been decided due to the fact that spectral 
cues at shorter distances (10 m) are insignificant and that sound has to travel more 
than 100 m for frequencies around 4kHz to be attenuated 7 dB [1]. 


? https://www.howtogeek.com/190014/htg-explains-what-is-the-difference-between-tcp-and- 
udp/. 
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The Max patch created is added to each audio track in Ableton as an audio effect. 
The azimuth and distance for left and right ear are shown to the user. Additional 
information is visualised for the user such as the wet amount of the reverb and the 
cutoff of the filter used. From a drop-down menu, the user can choose which port 
this track should listen to. This ensures that the same patch can be used for all the 
tracks and only the port has to be changed. Additionally, the IRs are read in and when 
loaded, they are automatically applied to all active patches in the Ableton project. 


9.9 Creating the Computer Version 


As earlier mentioned, a computer version was created for comparison purposes with 
the VR version. It has been decided to compare the VR version to its “computer 
screen’ counterpart, as this will put the possible advantages of the VR version in 
focus and thus not be influenced by aspects such as look or controls. The computer 
version of the final product was created to be as similar to the VR version as possible, 
however, there are some key differences when it comes to how the interaction is 
carried out. A ray is cast to wherever the mouse cursor is placed in the scene. When 
holding right-click, camera rotation is enabled, and the camera rotates based on 
mouse movement to imitate the camera interaction of the VR version. The ‘track’ 
object is ‘grabbed’ by hovering the mouse cursor over an object and holding left- 
click, whereafter it is possible to move and place the grabbed object by using the 
keyboard keys ‘W-A-S-D’. A combination of left- and right-click makes the grabbed 
object follow the camera rotation. 

The environment in the scene uses the same exact objects, coordinates and other 
visual aspects to eliminate any bias against or towards any of the two versions in this 
aspect. However, there are some small differences, because of different hardware, 
like display colours and refresh rate. 


9.10 Evaluation 


The following evaluation presents the setting, procedure and results of both the final 
focus group interview and mixing task evaluation. The mixing task evaluation uses a 
t-test to investigate the relationship of precision and time used between a VR version 
of the program and computer screen version. This is done to examine whether or not 
the benefits of VR and its sensory inclusion, can be seen as an overall improvement 
when mixing spatial audio. The computer screen version thus acts as the control 
condition representing similar interaction, affordances and sensory stimuli that can 
be found when placing audio spatially on a computer screen. However, it is important 
to note, that it is not a specific resemblance of already existing products. 
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For the t-tests, the following two hypotheses are further evaluated: 


e H0—The mixes made by the participants in the VR version have no difference or 
less relative precision to the reference mix, than the mixes made in the computer 
version. 

e H1—The participants in the VR version, used less time on recreating the reference 
mix, compared to the computer version. 


9.11 Setting and Procedure 


Both the focus group interview and part of the mixing task evaluation took place at 
Aalborg University Copenhagen. The other part of the mixing test took place in an 
apartment in Nørrebro due to convenience of test participants. Twelve students and 
a professor from the ‘Music Production Bachelor’ of the ‘Rhythmic Conservatory 
of Copenhagen’ attended the focus group interview, whereas 24 participants with 
different musical backgrounds and mixing experience, took part in the mixing task 
evaluation. 

The focus group interview was catried out after the ‘Danish Sound Network’ event 
‘Behind The Scenes’ on the 2nd of December 2019, where each participant had tried 
the VR mixing program for at least 5 min, including instructions. During the trial, 
the participants were instructed to try out and notice different aspects such as free 
rotation, elevation, auditory feedback and visual appearance. After the possibility of 
giving individual oral feedback, a focus group interview was conducted. Here each 
participant had the possibility of discussing and evaluating different topics selected 
by the conductor, with each other. The focus group interview was carried out in 
informal surroundings and lasted for 55 min. 

The mixing task evaluation was carried out over three days, from the 9th to 11th 
of December 2019. The evaluation took place in a separated room, where each 
participant tried both the VR version and the computer screen version of the program, 
in randomised order to avoid an experience bias. Both time and precision for each 
participant were computed. It is important to note that the precision of the participant 
mix in relation to the reference mix, is measured in units (in unity called metres). 
While the amplitude of sound normally decays logarithmically, the distance between 
sound sources close to the participant is of higher significance than sound sources 
farther away. Half a unit is, as an example, visually experienced as a bigger distance 
near the participant than for objects in the distance. The measure of precision should 
thus be seen as a relative unit of measurement and not a counterpart to objects in the 
real world. 

The participants were clearly instructed that their only task was to recreate the 
given audio mix in each version and that they had an unlimited amount of time 
until they felt satisfied with the mix. In both cases, clear instructions and a poster 
showing the different controls were offered. The participants had time to familiarise 
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themselves in each environment before they proceeded to the mixing task. The Oculus 
Quest VR-headset and a Macbook Pro were used for the VR and computer screen 
versions, respectively. The audio was routed from a separate computer through a pair 
of wireless AIAIAI TMA-2 headphones. The mixing task sessions lasted between 
25—45 min for each participant, depending on the amount of time used on each mix. 


9.12 Participants 


Before the participants of the mixing task evaluation started, they were asked to 
answer a few questions regarding their musical/production/mixing background. 
These questions were used to ensure that the participants, and therefore the data 
gathered, matched the pre-defined target group; hobby producers, semi-professional 
producers, professional mixing engineers and composers (Sect. 9.5). 

It could be seen that the majority of the participants had a lot of mixing experience, 
where 54.2% had 3+ years of experience, and of them, 41.7% had 5+ years of 
experience. The rest, 45.8% had 1-2 years of experience in mixing music. The same 
percentage was seen regarding their experience in mixing spatial audio (binaural, 
surround, ambisonic). The majority (54.2%) did not have any experience, whereas 
45.8% had experience. Lastly, a question regarding whether or not the participants 
had tried VR before, was asked, where the majority had tried it before (83.3%). 


9.13 Results 


This section will focus on presenting the results from the evaluation. The section 
will be divided into three different sub-sections: Focus group interview (conducted 
with students from Rhythmic Conservatory of Copenhagen) and the mixing task 
evaluation including a post-evaluation survey. 


9.13.1 Focus Group Interview 


As mentioned, the first part of the evaluation consisted of a focus group interview. 
Below different quotes, opinions and summaries divided into themes and main top- 
ics are outlined, as discussed in the focus group interview. See appendix C for full 
interview transcription. 


Efficiency 


e ‘I consider time. If it takes more or less time to do in VR. I don’t know’. 
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e ‘I could imaging that you would get done faster with some things. It seems very 
effective’. 


The participants were asked about their initial thoughts regarding VR as a mixing 
tool and were all concerned of how efficient it potentially could be. Some participants 
felt that VR might be used as a quick sketching tool and compared it to a big brush 
painting on a canvas. Besides the concept of this project, it could be used as a more 
creative tool, rather than something one would use for precision. 


Environment 


e ‘It could potentially set an atmosphere’. 

e ‘... when I mix it is definitely something visual happening in front of me, I see 
the elements in front of me. It is not necessarily that I see the orchestra in front 
of me—it is much more abstract. A sprinkle over here, the sub-frequencies being 
another shape’. 


It was stated that the virtual room could set a mood for the production by having 
different abstract elements. One participant pointed out that if the room should set a 
mood, it should be in a visually abstract way and not by looking realistic as this was 
the way he mixed mentally. It was furthermore stated that the decision of keeping 
the environment relatively neutral made sense in order not to influence the mix in 
an undesired direction and that the visuals used were pleasant and made sense in a 
mixing situation. 


Spatial Sound Algorithm 


e ‘I found it slightly under-dimensioned so when you panned things to the side it 
was not as much as you would imagine. Front and back made sense well. When I 
panned something this much to the right I would also expect to hear it more to the 
right’. 

One participant described the panning as being ‘under-dimensioned’, but in general 

the participants found the spatial algorithm satisfying. A few participants noticed the 

exclusion of elevation, whereas most participants felt the match between sound and 
source movement realistic. 


Features 


e ‘It could also be used to do automation in a mix. ... You would have a much bigger 
area to draw on. I think that would be extremely useful’. 

e ‘... they (objects) could have different shapes depending on if it is a string instru- 
ment or a wind instrument, or drums, or vocals’. 

e ‘Shouldn’t the other button be mute?’ 

e ‘I think it is necessary to know that there is activity on the track’. 

e ‘Put a number on the dB, like a gain volume’. 
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Multiple participants described how they could imagine using this tool to create 
automation. Having different visual representations of the different sound sources 
was also confirmed by several participants as well as having a visual representation 
of sound activity on each track as a VU-metre on mixers. 


Precision 


e ‘In a DAW [it would be most precise]’. 
e ‘But also hard to say, when we only tried this a bit’. 


There was a general consensus that a DAW was expected to be more precise than the 
implemented VR program. It was pointed out by one of the participants that it is hard 
to say when they have not spent more time using the VR program, but in general, the 
program, together with the binaural algorithm was experienced as something quick 
rather than precise. All participants agreed that the program gave enough information 
to make a judicious mix, but the concept made it hard to fiddle and go into small detail. 


Concept 


e <... [still think it suddenly is more about my experience mixing it instead of making 
an experience for others. And those two things can of course play well together, 
but I think I can install myself in a studio environment which helps me to get a 
good experience without a VR-headset’. 

e <... it feels a bit more like you are going to play a game. In some parts of the process 
it might be a positive thing, I mean also more for composition’. 

e ‘I cannot accept that I have not decided what it is this movement does. ... I do not 
have any emotional connection to this’. 

e ‘I think as the program is right now it might work even better for people who 
do not have experience making music and have to learn to visualise music in an 
extremely intuitive way’. 


A participant pointed out that he found the prototype to be useless for him since 
it was designed for spatial mixing and not directly suitable for exporting a stereo 
mix. He pointed out that the prototype seemed to be designed for the producer to 
have a good experience instead of the final listener to have a good experience. It was 
mentioned by one participant that the application would be more relevant to use if it 
at least included the functions of a large format console channel strip, for each audio 
track. 


Comfort in VR 


e ‘I felt a bit sick. When I took of the glasses I felt really dizzy, but I think it is 
something you maybe have to get used to’. 
e ‘(In the environment I could spend) 5—10 min or something like that’ 


One participant explained how he felt dizzy after using the prototype, while another 
participant imagined that he could not spend more than 5—10 min in VR. It was 
furthermore discussed by several participants whether switching between headset and 
screen was better than staying in VR. Both ideas were supported by other participants. 


9 Spatial Audio Mixing in Virtual Reality 291 


Table 9.1 Mean and Std Deviation of Precision and Time across the two experimental conditions 


Precision (Unity units) Time (s) 
VR 


Screen Screen 


Mean 
Std. dev. 


248.17 


9.13.2 Mixing Task 


Besides the focus group interview, the final evaluation consisted of a mixing task in 
two different versions. The means, standard deviations as well as histograms of the 
collected data for both time (in seconds) and sums of relative precision compared to 
the reference mix (in Unity units), are shown (Table 9.1). 

The level of measurement of this evaluation is ratio data and since the study 
is using a within-group design, homogeneity of variance can be assumed without 
testing this. To test if the data is normally distributed an Anderson-Darling test is 
used on each data set for both precision and time. 

For precision data, the results of the Anderson-Darling tests confirmed that the null 
hypothesis ‘the data are normally distributed’ should not be rejected, outputting p- 
values for screen version and VR version, respectively, p = 0.4381 and p = 0.0693. 

For time data, the results of the Anderson-Darling tests showed that the null 
hypothesis ‘the data are normally distributed’ should be rejected, outputting p-values 
for screen version and VR version respectively, p = 0.0005 and p = 0.0422, mean- 
ing that the data is not normally distributed. The differences of the samples were also 
seen in the standard error, which is a measure of the sample reflects the population, 
in this case, 66.3 and 50.5, respectively. Additionally, the correlation coefficients 
describing the relationship between accuracy and time spent with the mixing task 
indicate no significant correlation, neither in VR (r = —0.3) nor in the PC version 
(r = —0.2). 

Since only the data from the precision test is normally distributed and thereby 
fulfils all assumptions for parametric tests, t-tests will be used only on the precision 
data. The t-test does not reject the null hypothesis “The mixes made by the participants 
in the VR version has no difference or less relative precision to the reference mix, 
compared to the computer version’ with 95% confidence. A high p-value, (p = 
0.9531) simultaneously shows that no significant differences between means are 
found. 

It was mentioned by multiple test participants, that they had a significantly harder 
time spatially positioning the ‘choir’ track (a track in the pre-made mix) in the VR 
version, compared to the other tracks (discussed in Sect. 9.14). If the position of 
the ‘choir’ track is left out in the data sets, the t-test rejects the null hypothesis 
‘The mixes made by the participants in the VR version has no difference or less 
relative precision to the reference mix, compared to the computer version’ with a 
95% confidence level. It furthermore has a low p-value (p = 0.0015), meaning that 
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Questions regarding the controls Questions regarding the perception 


1 2 3 4 5 1 2 3 4 
(Blue = Computer, Orange = VR) (Blue = Computer, Orange = VR) 


Fig. 9.6 Results from the two different categories of questions. Left: Questions for the controls. 
Right: Questions for the perception 


Table 9.2 The means of the answers for each category and medium 


Controls Perception 
Overall 3.917 3.917 
VR 4.708 4.667 
Computer 3.125 3.167 


a significant difference between means is found and that there is a small possibility 
that the difference between the groups happened by chance alone. Moreover, the 
effect size of the t-test result was calculated to be 0.16, which is a small effect. 

As mentioned at the start of this section, a post-evaluation survey was carried 
out where the participants were asked to answer questions related to the two eval- 
uated platforms. The results can be seen in Fig. 9.6, where all items addressing the 
same aspect have been added together. This was possible as the categories showed 
a Cronbach’s alpha value of 0.81, meaning there is a good internal consistency, and 
thus reliability, within the answers [8]. Additionally, the mean of the answers, both 
overall for each category and separately for each medium is shown in Table 9.2. 


9.14 General Discussion 


The following section will discuss the results outlined in the result chapter above. It 
will examine the outcome of the t-tests and debate the opinions of the focus group 
interview. 
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Precision and Effect of ‘Choir’ Track 


As mentioned in Sect. 9.13.2, the results from the evaluation fail to reject the null 
hypothesis that the VR version has no difference or less precision compared to the 
computer version. Even though the mean of accuracy is 0.14 units lower in the VR 
version, the difference is so little that no conclusion can be made. However, it was 
seen that especially one audio track seemed to cause problems in the VR version, the 
‘choir’ track. The mean of the accuracy of the ‘choir’ track in the computer version 
was 1.86 while being 9.95 in the VR version, a difference of 8.09, even though the 
audio track being identical. There are a few things to consider why that may be the 
case. 

In the computer version, the position of the ‘choir’ track in the reference mix was 
very close to the initial position of the visible ‘choir’ track, while in the VR version 
the ‘choir’ track was positioned behind the user and with greater distance. However, 
that can be said about some of the other audio tracks as well. Another explanation 
could be that the ‘choir’ track, even though being converted to mono, had different 
amplitude changes for each voice and the recording included a natural reverb. This 
could have made it difficult to perceive at what distance as well as angle the source 
was placed in the scene. Some participants mentioned this, saying they felt the choir 
track was in stereo which hindered them from correctly placing the audio track. 
They simultaneously added that it was hard to navigate as the track faded in and out, 
making it difficult to know whether or not it was playing. By removing the choir 
track from both scenes the mean accuracy of the computer version goes from 35.74 
to 33.88 while the VR average goes from 35.60 to 25.63. Furthermore, the removal 
of the choir from both versions results in the rejection of the null hypothesis with 
a 95% confidence as mentioned in Sect. 9.13.2. One might, therefore, consider the 
fact that the VR version statistically can be seen as being significantly more precise 
than the computer version. 


Efficiency 


Another evaluated topic was whether the VR version was more efficient than the 
computer version. As mentioned in Sect. 9.13.2, the average time spent in VR was 
just under 2 min less than in the computer version. For the computer version fur- 
thermore,two outliers both spent just under 25 min. The time these outliners spent in 
the computer version does, however, not result in more accuracy as both were above 
the mean, 44.92 and 37.80, respectively. However, even though it seemed that the 
participants on average were quicker in the VR version, there was no possibility of 
proving this statistically, as the data did not meet all of the parametric assumptions. 

Looking at Fig. 9.6, its content might support the indication that VR, in this 
case, can be considered a quicker tool. Here it can be seen that the VR version 
scores significantly higher in regards to the controls and thus seemed more intuitive. 
Specifically, two questions stand out. Firstly, the question ‘I felt comfortable using 
the controls in the computer version’ resulted in a completely split opinion as can be 
seen in Fig. 9.7. 
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Fig. 9.7 Top: Question from s ERIS) 8 (333%) 

post questionnaire regarding 6 

the comfortableness of the $ 

controllers in the computer ie 

version. Bottom: Question ? Ee] 
1 2 3 


regarding the ease of placing : =- 
sound sources in the 
computer version 


109 


9 (37.5%) 
15 80335) 
50 
416.7%) 
oo — E EE 
1 2 3 4 


17 70.8%) 


Fig. 9.8 Top: Question 

regarding comfortableness of 15 
the VR controllers. 

Bottom: Question regarding 
the ease of placing sound 
sources in VR o 


7 (29.2%) 


00%) 010%) 010%) 


18 (75%) 


6 (25%) 
0(0%) 0 (0%) oes) 


Secondly, when asked ‘In the computer version, the controls made it easy to place 
the visual sound sources (spheres) where I wanted them’ the majority were either 
neutral or felt the controls made it difficult, as shown on Fig. 9.7. 

Since the participants struggled with placing the sound sources at the desired 
position, combined with finding the controls uncomfortable, it might have resulted 
in them taking longer time replicating the reference mix. Therefore, it cannot be 
concluded that VR is more efficient, however, the concept of the VR version hints 
towards a more instinctive way of working. The same questions for the VR versions 
resulted in slightly agree or agree, as shown in the figure (Fig. 9.8). 

Another important aspect to consider for the time data is why it did not meet 
the parametric assumption of normality. The data of the time spent in the different 
versions by the participants simply was not symmetrically distributed around the 
mean in each version. In both scenarios, outliers were experienced and especially 
the VR version included widely spread data. These conclusions are supported by 
the standard deviation of each data set being 325.21 and 248.17 s respectively. The 
mean in each data set is thus not a good representative measure of the participants 
and looking at the standard error being 66.3 and 50.5 s for each distribution, it addi- 
tionally is seen that the sample mean does not represent the population specifically 
well. Therefore, there could have been a missing correlation in the time spent by the 
participants and one might discuss whether the participants were enough alike to be 
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considered a unified sample group/population. The experience in VR does not show 
any warning signs as almost everyone (83.3%) had tried it before. The questions 
then lie in whether or not mixing experience, as well as experience with and without 
spatial audio, could have affected the efficiency of the participants, creating a gap 
between their results. However, nothing supports this as 3 of the 5 outliers of time 
spent in VR had 5+ years of experience mixing music, which also was seen in | out 
of the 2 outliers in the computer version. Therefore, the whole test setup and the way 
that time is measured have to be reconsidered. 


Focus Group Interview 


Looking at the qualitative data from the focus group interview two main findings 
are clear: 1. The participants saw the product as a quick sketching tool to test ideas 
and outline mixes, rather than a tool to control precision and finer spatial details 
within the sound, and 2. The participants were overall positive about the interaction 
with the product and its visual appearance, sensory benefits and intuitive controls, 
which were generally well-received. 

In relation to the first finding, several participants stated, just like the expert in 
the target group section, that time and intuitiveness, in general, was an important 
aspect for them in a mixing tool/device, and that the program indeed seemed to 
facilitate this. A participant expressed, amongst other things, that he “could imagine 
that you would get done faster with some things’, and that the program seemed ‘very 
effective’. Furthermore, participants agreed that the VR program definitely could be 
used as a quick sketching tool for swift ideas and testing of audio placement in a 
given space. This could, amongst other things, have been a result of the intuitive 
way of placing sound sources as well as the quick dynamic sound feedback and the 
possibility to link it up with Ableton Live. However, it could also have been due 
to the simplicity of the environment and the fact that only fundamental controls, 
pre-made audio effects and interaction possibilities were included, giving it a ‘to the 
bone’ concept. Besides allowing positioning of sound sources (panning and volume), 
the program simply did not offer state-of-the-art possibilities such as the potential to 
manipulate the sound in finer detail, thus forcing the participant to use more time in the 
environment. This was moreover seen in the ‘features’ discussion of the interview, 
where participants emphasised a need for interactive dB scales, mute buttons and 
the possibility to ‘do automation in a mix’. It is, therefore, clear that the opinion 
of the participants in the focus group interview, is somehow contradictory to the 
actual findings of the mixing task, which failed to show a significant time difference 
between the two conditions/programs. To understand this missing relationship, the 
post-evaluation survey as discussed above has to be taken into consideration. Besides 
the fact that the participants were overall satisfied with the interaction, it was clear in 
the opinion on the mixing task, that both the controls and the sensory feedback of the 
VR version made the experience easier, as well as more comfortable compared to the 
experience of spatially mixing audio on a computer screen. An aspect that, therefore, 
might have influenced the focus group participants’ opinion on efficiency, could have 
been the general user experience—even though no statistical evidence showed that 
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the participants were quicker in the VR program compared to the computer version. 
Using VR simply seemed like an intuitive and instinctive way of placing sound 
sources spatially, as it utilises both visual and auditory perceptional topics known 
from the everyday. 

Contrary to the time data, there actually was a significant difference in the precision 
of the mixes in the two environments, which as mentioned to some extent conflicts 
with the thoughts of the focus group. It is important to note that the difference is 
statistically evident without the ‘Choir’ track, but there is no doubt that it does not 
match the expectations of the focus group, who did not experience the program as 
something specifically precise. When compared to the framework of a DAW, there 
was a general consensus that the environment made it hard to be exact, and few 
participants even saw this as a result of the under-dimensioned mappings distance- 
and angle-wise. This was, as an example, expressed in the quote ‘I found it slightly 
under-dimensioned so when you panned things to the side it was not as much as you 
would imagine’. This may be the result of having 5° interval between IRs on the 
azimuth but could also just be the effect of the binaural HRIRs as they represent how 
sounds are naturally perceived instead of allowing for absolute panning. 

The focus group had different opinions on the visual appearance of the environ- 
ment, though the majority agreed that the neutrality of it was beneficial. Besides 
some participants mentioning that having the room aesthetics change relative to the 
mood of the music being mixed, the visualisation of the audio in the process of 
mixing was experienced to be subjective. Therefore, having pre-defined visuals and 
scenery might be more disadvantageous than beneficial. However, few participants 
mentioned that if the appearance should change, it should be unrealistic, abstract 
visuals since if it was too realistic it would be distracting. 

As mentioned above, there were a few features lacking in the program. Firstly, 
colour coding tracks according to instruments, as well as making different shapes 
depending on instruments lacked. Secondly, participants uttered that having feedback 
or audio-reactive shaders on the spheres representing the tracks would help the user 
to understand when a track has sound on it. Suggestions for this were either a VU- 
metre for each audio track (as mentioned above) or having the spheres change shape 
relative to the sounds they represent, using frequency or timbre to manipulate vertices. 
Additionally, the focus group agreed that adding more tracks in the VR environment 
would introduce clutter problems. As they only worked with five tracks, having 
more potentially could eliminate the benefits of VR compared to PC. Related to this, 
amongst other things it was stated: ‘When we tried it here it was very manageable 
with five tracks, but if you have 67 tracks as we talked about, it might hinder you 
more than it helps’. A suggestion for this was being able to group tracks, ‘Things 
could also be grouped visually, then you could press “show all” or see this group or 
this part [like in layers]. In that way, you could visually mute something’—as found 
in digital mixing consoles mentioned in the analysis. 

In relation to the concept, some participants struggled to grasp the core idea behind 
the product, spatial mixing. The fact that the mix changed relative to head movement 
confused many participants and hindered them in understanding the possibilities and 
functionality of it. As one participant said, ‘I often ended up looking one way and 
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then imagining that I mixed in stereo [...] this just made me feel that everything was 
in-precise’. Other mentioned that due to the lack of exporting it so the user could 
perceive it the same way, the product would be of no use, ‘...also the thing where 
the sound picture changes when you move. It is fun by itself but if it has to work in 
real life you have a dimension included that you do not have in the end’. The fact 
that they, in some way, did not seem to understand the thought process behind the 
product, could have resulted in them preferring to work in DAWs with traditional 
controls, which they were more comfortable with and, therefore, believe VR to be 
less accurate. 

While it is apparent that recreating a mix in the VR version in some instances and 
for some instruments, was more precise than doing it in the computer version, the 
results of the evaluation show no statistical evidence that it was more efficient for 
the participants to work in the VR version compared to the computer version. These 
findings contradict the opinions of the focus group interview of experts within the 
field of music production, who experienced the VR program as a quick sketching tool 
for ideas and a spatial overview. The following discussion will debate these results, 
and look further into the design of the test and the fulfilment of the requirements 
made for this product and project. 

Starting off by looking at the time data and the fact that the VR version was not 
more efficient than the computer version, it is important to consider the test setup 
and the way the researchers used time as a measurable variable. In the different test 
conditions, time was used as a dependent variable influenced by the independent 
variable being the two versions of the program. However, time simultaneously was 
of secondary importance, as the participants were told to recreate the mix until their 
satisfactory (until they thought they were precise enough). The amount of time used, 
and thus efficiency, was therefore not an explicit part of the mixing task and its validity 
as a measure, therefore, is debatable. Time was omitted from the task introduction 
as it was seen as a potential confounding variable of the precision measure, forcing 
the participants to slack on the mix recreation in order to get a quick time. Thus, 
it was hoped for time to represent the natural efficiency, however, the researchers 
did not reflect on the potential bias that could be within this, while designing the 
test—participants could as an example have had different visions about time, maybe 
they were busy or, contradictory, immersed using more time. Thus, in order for time 
to have been a valid measure, it might be argued that two tasks for each version 
should have been carried out: one with precision as seen in this evaluation, and one 
with time where the participants were asked to recreate the mix to their satisfaction 
as quickly as possible. This would have given efficiency a more prominent role and 
possibly made it a valid and streamlined measure. 

Another aspect that could have been changed about the test setup was the method- 
ology used. It initially was decided to use a one-way Repeated Measures ANOVA, 
with three conditions allowing the researchers to exclude the role of the controls in 
the test. Whereas the evaluation now only has the possibility to give indications about 
the role of the controls, the ANOVA test could have completely ruled this out, since a 
middle variable combing the two version was used for extra guarantee. Conclusions 
cannot, in fact, be made of the Likert scale in this post-evaluation, only indications 
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can be drawn, and therefore the scale’s relevance to the test as well as the connection 
to the opinions of the target group interview should be considered carefully. 

The opinions of the focus group and whether or not these, in general, are represen- 
tative, is additionally an aspect that should be discussed. It was, as a starting point, 
decided to only use participants in each evaluation process that were experienced 
with mixing sound in one way or another. It was seen that both participants in the 
focus group interview (counting music production students), as well as people in 
the mixing task (including ‘Sound and Music Computing’ students, tonmeister and 
sound engineer students as well as professional and hobby producers), fulfilled these 
requirements. Even though the two sample groups might have had different visions, 
they represent the target group and it is thus assumed that results and opinions can 
be compared. However, the target group definition might in the first place have been 
wrong. As mentioned by the focus group participants, who only saw the product as 
an easy and quick sketching tool, its intuitiveness could be beneficial for beginners, 
who might not care about fine detail nor the lack stereo features. It could thus have 
been interesting to see the feedback and mixing task results of potential music pro- 
duction newcomers with less experience, to see if the sensory benefits and intuitive 
interaction of VR, would make even more sense in a beginner situation. 

Additional aspect mentioned by the focus group was difficulties grasping the idea 
of mixing binaural audio. Multiple participants discussed how the fact that the audio 
mix changed relative to the head movement made it difficult to understand how the 
final mix would sound like. Especially the fact that if the mix was exported, they 
could not ensure that the end-user would hear it in the same way the mixing engineer 
intended. This was largely due to the fact that many of the focus group participants 
were locked on the idea of stereo mixing. The lack of a clear ‘centre’ position was 
new to them, as they are used to mixing in a fixed position in front of two speakers. 
With that said, the focus group agreed with the fact that the product enabled the user 
to very quickly and efficiently place sound at its correct position in the 3D space, 
possibly more quickly than with a computer. A possible application of the product 
was, as mentioned by one participant, placing audio sources at a correct position 
when working with movie sound. However, the lack of tools available, such as EQ, 
the possibility of choosing their own reverb, etc. would hinder them with mixing and, 
therefore, the product would be more suitable as an audio placement tool rather than 
mixing tool. Therefore, it is worth considering whether mixing music binaurally is 
suitable for current platforms and perhaps the focus of the test should have been on 
placing audio sources in a ‘correct’ position relative to the visuals when investigating 
accuracy. Additionally, an aspect that was not taken into consideration during the 
implementation of this project was the FoV. It is plausible that having a different FoV, 
be it bigger or smaller, could have affected the impression of the IRs. In other words, 
had the FoV been bigger, angles such as 90° or 270° may have mismatched the audio 
and visuals. The same can be said with a smaller FoV. This may have been a reason 
a member of the focus group felt the spatial algorithm to be under-dimensioned as 
he stated “When you panned things to the side it was not as much as you would 
imagine [...] When I panned something this much to the right I would also expect to 
hear it more to the right’. Even though no other comments were made regarding the 
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“spatialness’ of the sound, this is worth keeping in mind and it, looking back, could 
have been beneficial to make initial tests investigating the right relationship between 
the FoV and the sound. 

With regard to the final evaluation, some technical problems were experienced. 
The main problem happening when the user soloed an audio track. Almost every 
participant (some more than once during the test) encountered a problem where the 
audio track’s shader indicated that the track was soloed when in fact it was not. 
It is believed that this was caused by interference in the OSC (UDP) transmission 
between the computer running Unity and the one running Ableton Live. The problem 
took place when a track was soloed in Unity, which triggered the corresponding track 
to be soloed in Ableton Live. When the track was then un-soloed in Unity which 
updated the shader, the track stayed soloed in Ableton which caused confusion for the 
participant. This problem was something that was experienced during initial testing 
of the program and a specific trigger was created in order for the evaluator to quickly 
change/repair the state of the audio track in Ableton Live. This problem was almost 
non-existent in the VR version and could, therefore, have been a thing that affected 
the time measure. No initial test of controls and interaction was conducted before the 
final evaluation, whereas the VR version and controls had been tested with the focus 
group. The controls for the computer were thus purely decided by the project group, 
which may have resulted in non-intuitive controls and interaction, as was backed 
up by the post-evaluation survey. An initial evaluation or pilot test of the computer 
controls should have been conducted to ensure a more fair comparison. 

In relation to the technical aspects, it also is worth discussing whether or not 
the HMD used for this project was the correct choice. The Oculus Quest was the 
chosen HMD due to it being wireless and thus consumer-relevant, providing the 
highest screen resolution, as well as having a satisfactory refresh rate. However, 
since it has the hardware built into the headset, the computational power is limited. 
Therefore, the whole Unity project had to be specifically optimised for the Quest 
and the complexity of the scene reduced. This may have come at a price of limited 
features, reactive shaders and objects. Having VU metres for each audio track or 
having the shape of the spheres change relative to the sounds they represent, may 
have made it easier for the user to distinguish which tracks were active. More or 
improved lighting may as well have improved the aesthetics of the environment as 
a whole, which could have been possible with another HMD such as the HTC Vive 
that solely relies on the power of a computer running the program. Using a different 
HMD could thus have optimised the VR program as well as the overall appeal and 
desired features of the focus group participants might have been fulfilled. However, 
wireless capabilities, ease of use and accessibility would, in this case, have been lost. 
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9.15 Conclusion 


Based on an investigation of VR and spatial audio, it has been concluded that VR 
both has the sensory and interactive benefits to potentially enhance the experience 
of visually positioning sound sources in space. Its intuitive controls, sense of depth 
and user-including advantages are from research shown as instinctive behaviours 
and, from that standing point, this project examined whether or not aforementioned 
values could improve the process of spatial audio mixing, which nowadays mostly is 
carried out using 2D plugins on a computer screen. On this basis, a design framework 
for VR covering both interaction, binaural audio, perceptional cues, and graphical 
principles was built and an application was implemented to allow a user to visually 
mix real-time audio, retrieved from the DAW Ableton Live, using dynamic binaural 
synthesis. Each component and control of both the visual and the auditory system was 
carefully chosen based on requirements targeting both the benefits and necessities 
of pairing visuals and sound and in order to answer the final problem statement of 
the project, the VE was evaluated against its ‘computer screen’ counterpart. 

Defined as measurable improvements by the projects researchers, the two main 
aspects ‘efficiency’ and ‘precision’, were evaluated in two different conditions: one 
evaluation using a focus group interview with experts examining the opinions, per- 
ceptions and feelings about the VE, and one evaluation consisting of a mixing task 
that quantitatively measured time and precision differences between the VR and 
the computer version. The evaluations were carried out over four days at Aalborg 
University Copenhagen, and 13 and 24 participants took part in each evaluation 
respectively. 

The results of the evaluations showed ambiguous tendencies. The focus group 
participants were in general positively minded towards the program and saw it as a 
quick sketching tool due to its intuitiveness, controls and apparent sensory feedback, 
rather than a tool for finer detail and precision manoeuvring. These opinions were, 
however, not possible to prove in the mixing task test. Firstly, the data of time 
measurements did not meet the parametric assumption of normality, therefore it 
could not be tested through a t-test. Secondly, when comparing the means of the 
precision scores (average distance from reference mix) the t-test proved, that when 
removing the “Choir’ track—a track that caused problems for all participants—the 
VR version was more precise than the computer version with a 95% confidence. The 
evaluations thus indicate that even though the VR was not perceived as a precise 
tool, its sensory benefits and interaction possibilities, whose qualities both sample 
groups were in agreement about, are general improvements to the ones found on 
a computer. However, all conclusions should be taken with care, as especially the 
setup of the mixing task evaluation should have been reconsidered. A whole new 
test should, as an example, have been carried out to get efficiency as a valid measure 
and it simultaneously is important to note that the findings from the mixing task 
evaluation regarding the VE, can only be seen in the light of its “computer screen’ 
counterpart. 
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Part IV 
Sonic Experiences 


Chapter 10 A) 
Audio in Multisensory Interactions: geai 
From Experiments to Experiences 


Stefania Serafin 


Abstract In the real and virtual world, we usually experience sounds in combi- 
nation with at least an additional modality, such as vision, touch or proprioception. 
Understanding how sound enhances, substitutes or modifies the way we perceive and 
interact with the world is an important element when designing interactive multi- 
modal experiences. In this chapter, we present an overview of sound in a multimodal 
context, ranging from basic experiments in multimodal perception to more advanced 
interactive experiences in virtual reality. 


10.1 Introduction 


This book examines the role of sound in virtual environments (VEs). However, most 
of our interactions with both the physical and virtual worlds appear through a com- 
bination of different sensory modalities. Auditory feedback is often the consequence 
of an action produced by touch and is presented in the form of a combination of 
auditory, haptic and visual feedback. Let us consider for example the simple action 
of walking: The auditory feedback is given by the sound produced by the shoes 
interacting with the floor, the visual feedback is the surrounding environment, and 
the haptic feedback is the feeling of the surface one is stepping on. It is important 
that these different sensory modalities are perceived in synchronization, in order to 
experience a coherent action. 

Since sound can be perceived from all directions, it is ideal for providing informa- 
tion when the eyes are otherwise occupied. This could be a situation where someone’s 
visual attention should be entirely devoted to a specific task such as driving or a sur- 
geon operating on a patient [46]. Another notable property of the human auditory 
system is its sensitivity to the temporal aspects of sound [3]. In many instances, 
response times for auditory stimuli are faster than those for visual stimuli [55]. Fur- 
thermore, given the higher temporal resolution of the auditory system compared to 
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the visual system, people can resolve subtle temporal dynamics in sounds more read- 
ily than in visual stimuli; thus the rendering of data into sound may manifest periodic 
or other temporal information that is not easily perceivable in visualizations [17]. 
Moreover, the ears are capable of decomposing complex auditory scenes [3] and 
selectively attending to certain sources, as seen, for example, in the cocktail party 
problem [7]. Audition, then, may be the most appropriate modality for simple and 
intuitive (see [9, 37]) information display when data have complex patterns, express 
meaningful changes in time, or require immediate action. 

In this chapter, an overview is presented of how knowledge of human perception 
and cognition can be helpful in the design of multimodal systems where interactive 
sonic feedback plays an important role. Table 10.1 presents a typology of different 
kinds of cross- modal interactions, adapted from [2]. 

Sonic feedback can interact with visual or haptic feedback in different ways. 
As an example, cross-modal mapping represents the situation where one or more 
dimensions of a sound are mapped to visual or haptic feedback: A beeping sound 
combined with a flashing light. In cross-modal mapping, there is no specific interac- 
tion between the two modalities, but simply a function that connects some parameters 
of one modality to the parameters of another. 


Table 10.1 Typology of different kinds of cross-modal interactions 


Cross-modal interaction Description Example 


Amodal mapping Use of VEs or other The use of colour mapping and 
representational system to map | relative size in graphics and 
abstract or amodal information | scientific visualization (e.g., 
(e.g., time, amount, etc.) to colour, size, depth, etc.) 
some continuous or discrete 
sensory cue 


Cross-modal mapping Use of a VE to map one or An oscilloscope 
more dimensions of a sensory 
stimulus to another sensory 
channel 


Intersensory biases Stimuli from two or more Ventriloquism effect [24] 
sensory channels may 
represent 
discrepant/conflicting 
information 


Cross-modal enhancement Stimuli from one sensory Increased perceived visual 
channel enhance or alter the fidelity of display as a result of 
perceptual interpretation of increased auditory fidelity 
stimulation from another 
sensory channel 


Cross-modal transfers or Stimulation in one sensory Synesthesia 
illusions channel leads to the illusion of 
stimulation in another sensory 
channel 
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Intersensory biases become important where audition and a second modality pro- 
vide conflicting cues. In the following section, several examples of intersensory 
biases will be provided. In most of these situations, the user tries to perceptually 
integrate the conflicting information. This conflict might lead to a bias towards a 
stronger modality. One classic example is the ventriloquism effect [24], which illus- 
trates the dominance of visual over auditory information when spatially discrepant 
audio and visual cues are experienced as co-localized at the location of the visual 
cue. 

The name clearly derives from the ventriloquists, who are able to give the impres- 
sion that the speaking voice is originated from the dummy they are holding, as 
opposed as from the person herself. This effect is commonly used in cinemas and 
home theatres where, although the sound physically originates at the speakers, it 
appears as if coming from the moving images on screen. The ventriloquist effect 
occurs because the visual estimates of location are typically more accurate than 
the auditory estimates of location, and therefore the overall perception of location 
is largely determined by vision. This phenomenon is also known as visual cap- 
ture [64]. Another classic example is the Colavita effect [8]. In the original experi- 
ment, Colavita presented participants with an auditory (tone) or visual (light) stim- 
ulus, to which they were instructed to respond by pressing the tone key or light key 
respectively. When presented with bimodal stimuli, the visual dominance effect refers 
to the phenomenon where participants respond more often to the visual component. 

Vision is indeed the dominant sense in many circumstances. On one hand, visual 
dominance over hearing and other sensory modalities has been frequently demon- 
strated (e.g., [45]), and a neural basis has been posited for visual dominance in 
processing audiovisual objects (e.g., [48]). Cross-modal enhancement refers to stim- 
uli from one sensory channel enhancing or altering the perceptual interpretation of 
stimuli from another sensory channel. As an example, three studies presented in [57] 
show how high-quality auditory displays coupled with high-quality visual displays 
increase the quality perception of the visual displays relative to the evaluation of 
the visual display alone. Moreover, the same study shows how low-quality auditory 
displays coupled with high-quality visual displays decrease the perception of quality 
of the auditory displays relative to the evaluation of the auditory display alone. These 
studies were performed by manipulating the pixel resolution of the visual display 
and Gaussian white noise level, and by manipulating the sampling frequency of the 
auditory display and Gaussian white noise level. These findings strongly suggest that 
the quality of realism in an audiovisual display must be a function of both auditory 
and visual display fidelities and their interactions. Cross-modal enhancements can 
occur even when extra-modal input does not provide information directly meaningful 
for the task. An early study by Stein asked subjects to rate the intensity of a beam 
of light. Their findings showed that the test subjects believed the light to be brighter 
when it was accompanied by a brief, broadband auditory stimulus than when it was 
presented alone. The auditory stimulus produced more enhancement for lower visual 
intensities, regardless of the relative location of the auditory cue source. 

Cross-modal transfers or illusions are situations where stimulation in one sensory 
channel leads to the illusion of stimulation in another sensory channel. An example of 
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this is synesthesia, which in the audio-visual domain is expressed as the ability to see 
a colour while hearing a sound. When considering inter-sensory discrepancies, Welch 
and Warren propose a modality appropriateness hypothesis [64] that suggests that 
various sensory modalities are differentially well-suited to the perception of different 
events. Generally, it is supposed that vision is more appropriate for the perception of 
spatial location than audition, with touch sited somewhere in between. Audition is 
most appropriate for the perception of temporally structured events. Touch is more 
appropriate than audition for the perception of texture, where vision and touch may 
be about equally appropriate for the perception of textures. The appropriateness is a 
consequence of the different temporal and spatial resolution of the auditory, haptic 
and visual systems. Moreover, especially when it is combined with touch stimulation, 
sound increases the sense of immersion [63]. 

Apart from the way that the different senses can interact, the auditory channel also 
presents some advantages when compared to other modalities. For example, humans 
have a complete sphere of receptivity around the head, while visual feedback has a 
limited spatial region in terms of field-of-view and field-of-regard. Because auditory 
information is primarily temporal, the temporal resolution of the auditory system is 
more precise. We can discriminate between a single click and a pair of clicks when 
the gap is only a few tens of microseconds [30]. Perception of temporal changes 
in the visual modality is much poorer, and the fastest visible flicker rate in normal 
conditions is about 40-50 Hz [4]. In multi-sensory interaction, therefore, audio tends 
to elicit the shortest response time [33]. 

In contrast, the maximum spatial resolution (contrast sensitivity) of the human 
eye is approximately 1/30 degrees, a much finer resolution than that of the auditory 
system, which is approximately 1 degree. Humans are sensitive to sounds arriving 
from anywhere within the environment whereas the visual field is limited to the 
frontal hemisphere, with good resolution limited specifically to the foveal region. 
Therefore, while the spatial resolution of the auditory modality is cruder, it can serve 
as a cue to events occurring outside the visual field-of-view. 

In the rest of this chapter, we provide an overview of the interaction between audi- 
tion and vision and between audition and touch, together with guidelines on how such 
knowledge can be used in the design of interactive sonic systems. By understanding 
how we naturally interact in a world where several sensorial stimuli are provided, we 
can apply this understanding to the design of sonic interactive systems. Research on 
multisensory perception and cognition can provide us with important guidelines on 
how to design virtual environments where interactive sound plays an important role. 
Through technical advancements such as mobile technologies and 3D interfaces, it 
has become possible to design systems that have similar natural multimodal prop- 
erties as the physical world. These future interfaces understand human multimodal 
communication and can actively anticipate and act in line with human capabilities 
and limitations. A large challenge for the near future is the development of such 
natural multimodal interfaces, something that requires the active participation of 
industry, technology, and the human sciences. 
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10.2 Audio-Visual Interactions 


Research into multimodal interaction between audition and other modalities has 
primarily focused on the interaction between audition and vision. This choice is 
naturally due to the fact that audition and vision are the most dominant modalities 
in the human perceptual system [29]. A well-known multimodal phenomenon is the 
McGurk effect [38]. The McGurk effect is an example of how vision alters speech 
perception; for instance, the sound “ba” is perceived as “da” when viewed with the 
lip movements for “ga”. Notice that in this case, the percept is different from both the 
visual and auditory stimuli, so this is an example of intersensory bias, as described 
in the previous section. 

The different experiments described until now show a dominance of vision over 
audition, when conflicting cues are provided. However, this is not always the case. As 
an example, in [53, 54], a visual illusion induced by sound is described. When a single 
visual flash is accompanied by multiple auditory beeps, the single flash is perceived 
as multiple flashes. These results were obtained by flashing a uniform white disk for 
a variable number of times, 50 milliseconds apart, on a black background. Flashes 
were accompanied by a variable number of beeps, each spaced 57 milliseconds 
apart. Observers were asked to judge how many visual flashes were presented on 
each trial. The trials were randomized and each stimulus combination was run five 
times on eight naive observers. Surprisingly, observers consistently and incorrectly 
reported seeing multiple flashes whenever a single flash was accompanied by more 
than one beep [53]. This experiment is known as sound-induced flash illusion. A 
follow-up experiment investigated whether the illusory flashes could be perceived 
independently at different spatial locations [26]. Two bars were displayed at two 
locations, creating an apparent motion. All subjects reported that an illusory bar was 
perceived with the second beep at a location between the real bars. This is analogous 
to the cutaneous rabbit perceptual illusion, where trains of successive cutaneous 
pulses delivered at a few widely separated locations produce sensations at many 
in-between points [19]. As a matter of fact, perception of time, wherein auditory 
estimates are typically more accurate, is dominated by hearing. 

Another experiment explored whether two objects appear to bounce of each other 
or simply cross, if observers hear a beep when the objects could be in contact. In this 
particular case, a desktop computer displayed two identical objects moving towards 
each others. The display was ambiguous to provide two different interpretations 
after the objects met: They could either bounce off each other or cross. Since colli- 
sions usually produce a characteristic impact sound, introducing such sound when 
objects met promoted the perception of bouncing versus crossing. This experiment 
is usually known as motion-bounce illusion [51]. In a subsequent study, Sekuler and 
Sekuler found that any transient sound temporally aligned with the would-be colli- 
sion increased the likelihood of a bounce percept [50]. This includes a pause, a flash 
of light on the screen, or a sudden disappearance of the discs. Auditory dominance 
has also been found in other examples with respect to time-based abilities such as 
precise temporal processing [47], temporal localization [5], and estimation of time 
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durations [43]. Lipscomb and Kendall [34] provide another example of auditory 
dominance in a multimedia context (film). These researchers found that variation in 
participant semantic differential ratings was influenced more by the musical com- 
ponent than by the visual element. Particularly interesting in its implications for 
processing multisensory experiences is [22] pointing to the disappearance of visual 
dominance when a visual signal is presented simultaneously with an auditory and 
haptic signal (i.e., as a tri-sensory combination). The authors concluded that while 
vision can dominate both the auditory and the haptic sensory modalities, this is lim- 
ited to bi-sensory combinations in which the visual signal is combined with another 
single stimulus. 

More recent investigations examined the role of ecological auditory feedback 
in affecting multimodal perception of visual content. As an example, in a study 
presented in [15], the combined perceptual effect of visual and auditory information 
on the perception of a moving object’s trajectory was investigated. Inspired by the 
experimental paradigm presented in [27], the visual stimuli consisted of a perspective 
rendering of a ball moving in a three-dimensional box. Each video was paired with 
one of three sound conditions: Silence, the sound of a ball rolling, or the sound of 
a ball hitting the ground. It was found that the sound condition influenced whether 
observers were more likely to perceive the ball as rolling back in depth on the floor 
of the box or jumping in the frontal plane. 

Another interesting study related to the role of auditory cues in the perception 
of visual stimuli is the one presented in [60]. Two psychophysical studies were 
conducted to test whether visual sensitivity to point-light depictions of human gait 
reflects the action specific co-occurrence of visual and auditory cues typically pro- 
duced by walking people. To perform the experiment, visual walking patterns were 
captured using a motion capture system, and a between-subject experimental proce- 
dure was adopted. Specifically, subjects were randomly exposed to one of the three 
experimental conditions: No sound, footstep sounds, or a pure tone at 1000 Hz, which 
represented a control case. Visual sensitivity to coherent human gait was measured 
by asking subjects if they could detect a person walking or not. Such sensitivity was 
greatest in the presence of temporally coincident and action-consistent sounds, in this 
case, the sound of footsteps. Visual sensitivity to human gait with coincident sounds 
that were not action-consistent, in this case the pure tone, was significantly lower and 
did not significantly differ from visual sensitivity to gaits presented without sound. 

As an additional interaction between audition and vision, sound can help the 
user search for an object within a cluttered, continuously changing environment. It 
has been shown that a simple auditory pip drastically decreases search times for 
a synchronized visual object that is normally very difficult to find. This is known 
as the pip and pop effect [62]. Visual feedback can also affect several aspects of a 
musical performance, although in this chapter affective and emotional aspects of a 
musical performance are not considered. As an example, Schutz and Lipscomb report 
an audio-visual illusion in which an expert musician’s gestures affect the perceived 
duration of a note without changing its acoustic length [49]. To demonstrate this, they 
recorded a world-renowned marimba player performing single notes on a marimba 
using long and short gestures. They paired both types of sounds with both types of 
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gestures, resulting in a combination of natural (i.e., congruent gesture-note pairs) and 
hybrid (i.e., incongruent gesture-note pairs) stimuli. They informed participants that 
some auditory and visual components had been mismatched, and asked them to judge 
tone duration based on the auditory component alone. Despite these instructions, the 
participants’ duration ratings were strongly influenced by visual gesture information. 
As a matter of fact, notes were rated as longer when paired with long gestures than 
when paired with short gestures. These results are somehow puzzling, since they 
contradict the view that judgments of tone duration are relatively immune from visual 
influence [64], that is, in temporal tasks visual influence on audition is negligible. 
However, the results are not based on information quality, but rather on perceived 
causality, given that visual influence in this paradigm is dependent on the presence 
of an ecologically plausible audiovisual relationship. 

Indeed, it is also possible to consider the characteristics of vision and audition to 
predict which modality will prevail when conflicting information is provided. In this 
direction, [31] introduced the notion of auditory and visual objects. They describe 
the different characteristics of audition and vision, claiming that a primary source of 
information for vision is a surface, while a secondary source of information is the 
location and colour of sources. On the other hand, a primary source of information 
for audition is a source and a secondary source of information is a surface. 

In [16], a theory is suggested on how our brain merges the different sources 
of information coming from the different modalities, specifically audition, vision, 
and touch. The first is what is called sensory combination, which means the maxi- 
mization of information delivered from the different sensory modalities. The second 
strategy is called sensory integration, which means the reduction of variance in the 
sensory estimate to increase its reliability. Sensory combination describes interac- 
tions between sensory signals that are not redundant. By contrast, sensory integration 
describes interactions between redundant signals. Ernst and coworkers [16] describe 
the integration of sensory information as a bottom-up process. 

The modality precision, also called modality appropriateness hypothesis, by [64], 
is often cited when trying to explain which modality dominates under what circum- 
stances. This hypothesis states that discrepancies are always resolved in favour of 
the more precise or more appropriate modality. In spatial tasks, for example, the 
visual modality usually dominates, because it is the most precise at determining spa- 
tial information. However, according to [16], this terminology is misleading because 
it is not the modality itself or the stimulus that dominates. Rather, the dominance 
is determined by the estimate and how reliably it can be derived within a specific 
modality from a given stimulus. 

A major design dilemma involves the extent to which audio interfaces should 
maintain the conventions of visual interfaces [40]. Indeed, most attempts at auditory 
display seek to emulate or translate elements of visual interfaces to the auditory 
modality. While retrofitting visual interfaces with sound can offer some consisten- 
cies across modalities, the constraints of this approach may hinder the design of 
auditory interfaces. While visual objects exist primarily in space, auditory stimuli 
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occur in time. A more appropriate approach to auditory interface design, therefore, 
may require designers to focus more strictly on auditory capabilities. Such interfaces 
may present the items and objects of the interface in a fast, linear fashion over time 
rather than attempting to provide auditory versions of the spatial relationships found 
in visual interfaces. 


10.3 Embodied Interactions 


The experiments described until now assume a passive observer, in the sense that a 
subject is exposed to a fixed sequence of audiovisual stimuli and is asked to report 
on the resulting perceptual experience. When a subject is interacting with the stimuli 
provided, a tight sensory motor coupling is enabled, that is an important character- 
istic of embodied perception. According to embodiment theory, a person and the 
environment form a pair in which the two parts are coupled and determine each 
other. The term embodied highlights two points: First, cognition depends upon the 
kinds of experience that are generated from specific sensorimotor capacities. Second, 
these individual sensorimotor capacities are themselves embedded in a biological, 
psychological, and cultural context [14]. 

The notion of embodied interaction is based on the view that meanings are present 
in the actions that people engage in while interacting with objects, with other peo- 
ple, and with the environment in general. Embodied interfaces try to exploit the 
phenomenological attitude of looking at the direct experience, and let the meanings 
and structures emerge as experienced phenomena. Embodiment is not a property 
of artefacts but rather a property of how actions are performed with or through the 
artefacts. 

The central role of our body in perception, cognition and interaction, has been 
previously addressed by philosophers (e.g., [39]), psychologists (e.g., [41]) and neu- 
roscientists (e.g., [10]). A rather recent approach to the understanding of the design 
process, especially in its early stages, has been to focus on the role of multimodality 
and the contribution of non-verbal channels as key means of communication, kinaes- 
thetic thinking, and more generally of doing design [59]. Audio-haptic interactions, 
described in the following section, also require a continuous action-feedback loop 
between a person and the environment, an important characteristic of embodied per- 
ception. Another approach, called embodied sound design, proposes to place the 
bodily experience (i.e., communication of sonic concepts through vocal and gestural 
imitations) at the centre of the sound creation process [12]. 

The role of the body in HCI has overall recently gained more attention, and 
interested readers can refer to the book by Hook [23] and to Chap. 7 in this volume. 
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10.4 Audio-Haptic Interactions 


Although the investigation of audio-haptic interactions has not received as much 
attention as audiovisual interactions, it is certainly an interesting field of research, 
especially considering the tight connections existing between the sense of touch and 
audition. As a matter of fact, both audition and touch are sensitive to the very same 
kind of physical property, that is, mechanical pressure in the form of oscillations. 
The tight correlation between the information content (oscillatory patterns) being 
conveyed in the two senses can potentially support interactions of an integrative 
nature at a variety of levels along the sensory pathways. Auditory cues are normally 
elicited when one touches everyday objects, and these sounds often convey useful 
informational regarding the nature of the objects [18]. The feeling of skin dryness or 
moistness that arises when we rub our hands against each other is subjectively referred 
to the friction forces at the epidermis. Yet, it has been demonstrated that acoustic 
information also participates in this bodily sensation, because altering the sound 
arising from the hand rubbing action changes our sensation of dryness or moistness 
at the skin. This phenomenon is known as the parchment-skin illusion [25]. 

The parchment-skin illusion is an example of how interactive auditory feedback 
can affect subjects’ haptic sensation. Specifically, in the experiment demonstrating 
the rubber-skin illusion, subjects were asked to sit with a microphone close to their 
hands, and then to rub their hands against each other. The sound of hands rubbing was 
captured by a microphone; they were then manipulated in real time, and played back 
through headphones. The sound was modified by attenuating the overall amplitude 
and by amplifying the high frequencies. Subjects were asked to rate the haptic sen- 
sation in their palms as a function of the different auditory cues provided, in a scale 
ranging from very moist to very dry. Results show that the provided auditory feedback 
significantly affected the perception of the skin’s dryness. This study was extended 
in [20], by using a more rigorous psychophysical testing procedure. Results reported 
a similar increase in smooth-dry scale correlated to changes in auditory feedback, 
but not in the roughness judgments per se. However, both studies provide convincing 
empirical evidence demonstrating the modulatory effect of auditory cues on people’s 
haptic perception of a variety of different surfaces. A similar experiment was per- 
formed combining auditory cues with haptic cues at the tongue. Specifically, subjects 
were asked to chew on potato chips, and the sound produced was again captured and 
manipulated in real time. Results show that the perception of potato chips’ crisp- 
ness was affected by the auditory feedback provided [56]. A surprising audio-haptic 
bodily illusion that demonstrates human observers rapidly update their assumptions 
about the material qualities of their body is the marble hand illusion [52]. By repeat- 
edly gently hitting participants’ hand while progressively replacing the natural sound 
of the hammer against the skin with the sound of a hammer hitting a piece of marble, 
it was possible to induce an illusory misperception of the material properties of the 
hand. After 5 min, the hand started feeling stiffer, heavier, harder, less sensitive, 
and unnatural, and showed enhanced galvanic skin response to threatening stimuli. 
This bodily illusion demonstrates that the experience of the material of our body can 
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be quickly updated through multisensory integration. Another interesting example 
where sounds again affect body perception is shown in [58]. Here, the illusion is 
applied to footstep sounds. By digitally varying sounds produced by walking, it is 
possible to vary one’s perception of weight. 

Lately, artificial cues are appearing in audiohaptic interfaces, allowing us to care- 
fully control the variations to the provided feedback and the resulting perceived 
effects on exposed subjects [13, 42, 61]. Artificial auditory cues have also been used 
in the context of sensory substitution, for artificial sensibility at the hands using 
hearing as a replacement for loss of sensation [35]. In this particular study, micro- 
phones placed at the fingertips captured and amplified the friction sound obtained 
when rubbing hard surfaces. 

In [28], a nice investigation on the interaction between auditory and haptic cues 
in the near space is presented. The authors show an interesting illusion of how 
sounds delivered through headphones, presented near to the head induces an haptic 
experience. The left ear of a dummy head was stroked with a paintbrush and the 
sound was recorded. The sound was then presented to the participants who felt a 
tickling sensation when the sound was presented near to the head, but not when it 
was presented distant from the head. 

Another kind of dynamic sonic objecthood is that obtained through data physical- 
ization, which is the 3D rendering of a dataset in the form of a solid physical object. 
Although there is a long history of physicalization, this area of research has become 
increasingly interesting through the facilitation of 3D printing technology. Physical- 
izations allow the user to hold and manipulate a dataset in their hands, providing an 
embodied experience that allows rich naturalistic and intuitive interactions such as 
multi-finger touch, tapping, pressing, squeezing, scraping, and rotating [36]. 

Physical manipulation produces acoustic effects that are influenced by the material 
properties, shape, forces, modes of interaction and events over time. The idea that 
sound could be a way to augment data physicalization has been explored through 
acoustic sonifications in which the 3D printed dataset is super-imposed on the form 
of a sounding object, such as a bell or a singing bowl [1]. Since acoustic vibrations 
are strongly influenced by 3D form, the sound that is produced is influenced by 
the dataset that is used to shape the sounding object. On a similar vein, the design 
of musical instruments has also inspired the design of new interfaces for human- 
computer interaction. As stated by Jaron Lanier, musical instruments are the best 
user interfaces (see [1]), and we can learn to design new interfaces by looking at 
musical instruments. An example is the work of [32], where structural elements 
along the speaker-microphone pathway characteristically alter the acoustic output. 
Moreover, Chap. 12 proposes several case studies in the context of musical haptics. 

In designing multimodal environments, several elements need to be taken into con- 
sideration. However, technology imposes some limitations, especially when the ulti- 
mate goal is to simulate systems that react in realtime. This issue is nicely addressed 
by Pai, who describes a tradeoff between accuracy and responsiveness, a crucial 
difference between models for science and models for interaction (see [44]). Specifi- 
cally, computations about the physical world are always approximations. In general, 
it is possible to improve accuracy by constructing more detailed models and per- 
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forming more precise measurements, but this increased accuracy comes at the cost 
of latency, i.e., the elapsed time before an answer is obtained. For multisensory 
models, it is also essential to ensure synchronization of time between different sen- 
sory modalities [44]. groups all of these temporal considerations, such as latency 
and synchronization, into a single category called responsiveness. The question then 
becomes how to balance accuracy and responsiveness. The choice between accuracy 
and responsiveness depends also on the final goal of the multimodal system design. 
Often, scientists are more concerned with accuracy, so responsiveness is only a soft 
constraint based on available resources. On the other hand, for interaction designers, 
responsiveness is an essential parameter that must be satisfied. 


10.5 Conclusions 


This chapter has provided an overview of several experiments whose goals were to 
achieve a better understanding of how the human auditory system is connected to 
visual and haptic channels. A better understanding of multimodal perception can 
have several applications. As an example, systems based on sensory substitution 
help people lacking a certain sensorial modality by replacing it with another sen- 
sorial modality. Moreover, cross-modal enhancement allows reduced stimuli in one 
sensorial modality to be augmented by a stronger stimulation in another modality. 

Contemporary advances in hardware and software technology allow us to exper- 
iment in several ways with technologies for multimodal interaction design, building 
for example, haptic illusions with equipment available in a typical hardware store [21] 
or easily experimenting with sketching and rapid prototyping [6, 11]. These advances 
in technology create several possibilities for discovering novel cross-modal illusions 
and interactions between the senses, especially when a collaboration between cogni- 
tive psychologists and interaction designers is facilitated. A research challenge is not 
only to understand how humans process information coming from different senses, 
but also how information in a multimodal system should be distributed to different 
modalities in order to obtain the best user experience. 

As an example, in a multi-modal system such as a system for controlling an haptic 
display, seeing a visual display and listening to interactive auditory display, it is 
important to determine which synchronicities are more important. At one extreme, a 
completely disjointed distribution of information over several modalities can offer the 
highest bandwidth, but the user may be confused in connecting the modalities and one 
modality might mask another and distract the user by focusing attention on events that 
might not be important. At the other extreme, a completely redundant distribution of 
information is known to increase the cognitive load and is not guaranteed to increase 
user performance. 

Beyond the research on multimodal stimuli processing, studies are needed on 
the processing of multimodal stimuli that are connected via interaction. We would 
expect that the human brain and sensory system have been optimized to cope with 
a certain mixture of redundant information, and that information displays are better 
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the more they follow this natural distribution. Overall, the more we achieve a better 
understanding of the ways humans interact with the everyday world, the more we 
can obtain inspiration for the design of effective natural multimodal interfaces. 
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Chapter 11 A) 
Immersion in Audiovisual Experiences geai 


Sarvesh R. Agrawal and Søren Bech 


Abstract Understanding the concept of immersion and its influencing factors is 
critical for enabling engaging audiovisual experiences. However, a lack of defini- 
tional consensus and suitable methods for assessing immersion hinder research on 
the subject. This chapter discusses the idea of immersion based on a non-exhaustive 
literature review of the topic and presents an adaptable definition of immersion that 
is not limited to virtual reality applications. Additionally, an exploratory experimen- 
tal paradigm for measuring immersion in audiovisual experiences is described. The 
description of immersion and the experimental framework presented in this chapter 
are a starting point for resolving the difference in opinion and developing novel 
methods to thoroughly explore the concept of immersion respectively. 


11.1 Introduction 


Audiovisual technology has advanced drastically over the last decade. Spatial audio in 
conjunction with advanced visual technologies such as enhanced color reproduction 
and greater dynamic range is witnessing wide scale adoption for domestic audiovisual 
applications (e.g., gaming, entertainment, broadcast). In addition to the technological 
progress, the emergence of virtual reality (VR) and augmented reality (AR) is swiftly 
changing the paradigm for domestic audiovisual experiences. 
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The vocabulary for describing new audiovisual experiences unlocked by these 
technologies has evolved as well. Immersion has emerged as the predominant term 
for describing audiovisual experiences. Nevertheless, the concept of immersion is 
poorly understood. Immersion is studied in a variety of different field such as film 
[52, 55, 68], video games [1, 8, 31, 53, 58], virtual reality [24, 41, 46, 51], and 
music [4, 14]. It is used to describe a large array of experiences that contributes to 
the ambiguity surrounding the term. Immersion is often considered synonymous to 
presence and envelopment which further dilutes the concept. A lack of definitional 
consensus and the interchangeable use with terms such as presence have reduced 
immersion to an “excessively vague, all inclusive concept” [39]. A formal definition 
of immersion is a prerequisite for communicating the idea effectively and conducting 
research on the topic. Thus, the first half of this chapter attempts to formalize the 
meaning immersion by proposing a definition that has been synthesized from a non- 
exhaustive literature review of the subject. A wide perspective has been adopted for 
the proposed definition such that it can be easily adapted for different applications 
as well as interactive and non-interactive activities. 

As technologists, we are interested in enabling experiences with a greater degree 
of immersion on the premise that more immersive experiences are preferable. This 
can be achieved by developing a deeper understanding of the various factors that influ- 
ence immersion and subsequently harnessing their capabilities for delivering more 
immersive experiences. However, the fundamental challenge in investigating immer- 
sion is a lack of methodologies for measuring immersion. To this end, an exploratory 
study was conducted for quantifying immersion in audiovisual experiences as a first 
step. The experimental framework detailed in the latter half of this chapter can form 
the basis for developing experimental paradigms aimed at investigating the impact 
of immersion’s influencing factors. 


11.2 Conceptualizations of Immersion 


Immersion is a complex subject that can have a different meaning depending on the 
context and the field of study. While the origin of immersion’s conceptualization 
is unknown, it is agreed that it is a metaphorical term derived from the physical 
experience of being surrounded by a completely different medium. Murray [43] has 
provided the following description of immersion: 


Immersion is a metaphorical term derived from the physical experience of 
being submerged in water. We seek the same feeling from a psychologically 
immersive experience that we do from a plunge in the ocean or swimming pool: 
the sensation of being surrounded by a completely other reality, as different 
as water is from air, that takes over all of our attention, our whole perceptual 
apparatus ([43], p. 99). 
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Property of the technology/system 


Subjective experience of being that facilitates the experience 


mentally involved in an activity 


Objective property that can be quantified 


Siate Ohi ental abeeretion based on the technical specifications 


T Reasons that can lead to psychological immersion 


Sense of being 
surrounded and/or 
experiencing 
multisensory 
stimulation 


Absorption in the 
narrative or its 
depiction 


Absorption when facing 
strategic and/or tactical 
challenges 


Sense of being enveloped 
by the environment. Often 
achieved by blocking the 


Mental absorption in the 
story. Classified as spatial, 
temporal, and emotional 


Mental absorption when 
planning, strategizing, etc. 


or responding with swift 
tactile movements 


external world and/or 
delivering overpowering 
sensory information 


immersion 


Fig. 11.1 Structure of the proposed literature review. Adapted from [3] 


The analogy of “experience of swimming underwater” has been open to interpre- 
tation as some researchers have approached the topic from a physical perspective 
(i.e., being surrounded by a different reality) while others view it from a psycholog- 
ical viewpoint (i.e., similar to the metaphorical derivation described by Murray [43] 
where attention is a factor). The descriptions of immersion appearing in literature can 
be largely classified into two perspectives: immersion as a psychological experience 
and immersion as an objective property of the system or the technology that facili- 
tates the experience. A brief introduction to these perspectives and a visual summary 
of the literature review in this chapter is provided by Fig. 11.1. 


11.2.1 Psychological Perspective 


The psychological perspective on immersion states that immersion is the psycho- 
logical state of the individual when they are mentally involved in an activity [37]. 
It argues that attention is at the heart of immersion and de-emphasizes the role of 
the system or the technology that mediates the experience.! Instead, significance is 


' This is different from the concept of presence which is heavily influenced by the capabilities of 
the system/technology. Presence is the illusion of being in an environment other than the physi- 
cal environment around the user in mediated experiences. Please refer to Sect. 11.4.1 for a brief 
discussion on the distinction between presence and immersion. 
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placed on the narrative and its presentation along with the individual participating in 
the experience. The idea of psychological immersion can be illustrated through the 
example of reading books. Books provide limited sensory stimulation to the reader in 
comparison to multisensory audiovisual experiences; nevertheless, the narrative con- 
tent presented by books and its relevance to the reader can lead to a psychologically 
immersive experience. 

The three recognized reasons that can lead to psychological immersion are the 
sense of being surrounded, absorption in the narrative or its depiction, and absorption 
when facing challenges. While these are often viewed as different types/dimensions 
of immersion, we believe that conclusive evidence is required to determine if the 
experiences they lead to are fundamentally different to warrant the classification 
of psychological immersion. An overview of the three reasons is presented in the 
following subsections. 


11.2.1.1 Sense of Being Surrounded or Experiencing Multisensory 
Stimulation 


Immersion? is often viewed as a perceptual experience that is directly dependent on 
the capabilities of the rendering system. The sense of being surrounded or experienc- 
ing multisensory stimulation is a prevalent conceptualization of immersion. Biocca 
and Delaney [7] dubbed this perceptual immersion: the extent of submersion of the 
user’s perceptual system in the environment. It is believed that perceptual immersion 
can be measured objectively by “counting the number of the user’s senses that are 
provided with input and the degree to which inputs from the physical environment 
are shut out” [36]. McMahan [39] stated that perceptual immersion can be achieved 
by blocking the external world and constraining the user’s perception to the presented 
stimulus. 

The role of sensory information in immersive gaming experiences was recog- 
nized by Ermi and Mäyrä [16] for the development of a gameplay experience model 
(sensory, challenge-based, and imaginative immersion model or SCI model). The 
authors called it sensory immersion: an overpowering of the sensory information 
from the real environment through large screens and powerful sounds to focus the 
user entirely on the stimulus. In their study on presence, Witmer and Singer [70] 
made the distinction between immersion and involvement such that the former is the 
subjective experience of being enveloped in an interactive environment and the latter 
is a psychological state which results from directing attention to the stimulus. 

It may appear that what many researchers call perceptual or sensory immersion is 
a completely different perspective on immersion compared to psychological immer- 
sion. Nevertheless, it is instead a facilitator for psychological immersion since over- 
powering sensory information or blocking the stimuli from the immediate environ- 


? This section was originally published in [3]. 
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ment does not guarantee psychological immersion but can prevent “an exogenous 
shift of attention” [45] away from the activity; consequently, leading to psycho- 
logical immersion. The current attempts to create supposed immersive audiovisual 
experiences are based on this idea of eliciting psychological immersion. It is assumed 
that augmenting the sensory information (e.g., in spatial audio reproduction) and/or 
attempting to reduce the inputs from the physical environment (e.g., virtual real- 
ity) will lead to the users focusing on the stimulus and experiencing psychological 
immersion. 


11.2.1.2 Absorption in the Narrative or its Depiction 


The role of the narrative is considered to be an important dimension of the immersive 
experience. Mental absorption in the story or the mediated world is the definition 
of being immersed on a diegetic level. Adam and Rollings [1] called it narrative 
immersion: “the feeling of being inside a story, completely involved and accepting 
the world and events of the story as real.” A similar description has been provided by 
Thon [45]: “narrative immersion refers to the player’s shift of attention to the unfold- 
ing of the story of the game and the characters therein as well as to the construction of 
a situation model representing not only the various characters and narrative events, 
but also the fictional game world as a whole.” The idea of narrative immersion has 
been echoed in the context of video games under imaginative immersion [16] and as 
fictional immersion [5] for all narrative forms. 

It has been suggested that an exciting story and interesting characters are prereq- 
uisites for experiencing narrative immersion [1]. Ryan [57] classified the causes that 
lead to narrative immersion as temporal, spatial, and emotional immersion. Tempo- 
ral immersion is experienced when one is curious to known how the story unfolds. 
Spatial immersion refers to the experience of having a sense of space and enjoyment 
in exploration. Lastly, emotional immersion occurs when one is emotionally invested 
in the story and/or emotionally attached to the characters. It can also be observed 
when the narrative elements remind the individual of emotionally relevant instances 
or characters. 


11.2.1.3 Absorption When Facing Challenges 


The idea of being absorbed when facing challenges stems from the work conducted 
on immersion in gaming experiences. Absorption in the activity due to challenges 
occurs when a balance is achieved between ability and the perceived challenge [16]. 
These challenges can be mental challenges or sensorimotor challenges. Ermi and 
Mäyrä [16] believed that the challenges encountered will often be a combination 
of mental and sensorimotor challenges to a certain extent. Thus, the individuals 
must have attentional surplus to face the challenges simultaneously or the overlap 
between the challenges must be brief to avoid attentional overload [44]. The nature 
of the challenge was used to distinguish challenge-based immersion as strategic and 
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tactical immersion by Adam and Rollings [1]. Strategic immersion is experienced 
when one is preoccupied strategizing and making choices mentally to conquer the 
task on hand. Tactical immersion refers to the state of mental absorption when one 
is fully concentrated on the activity that has a stream of demands for swift tactile 
movements (e.g., when playing action-packed video games). 

Challenge in the current view refers to active hurdles encountered in participa- 
tory activities. Arsenault [5] argued that challenges are not required to experience 
immersion and suggested to substitute challenge-based immersion with systematic 
immersion: immersion in the activity where one accepts the mechanics (e.g., rules, 
physical movement, etc.) of the mediated experience instead of the mechanics from 
the unmediated reality. The idea of systematic immersion can be applied to non- 
participatory activities? such as a screening of a fictional film where one readily 
accepts the existence of magic and flying sea mammals, for instance. 


11.2.2 Physical Perspective 


A substantial portion of the work on immersion has been performed in the context 
of media consumption for interactive applications (e.g., video games, virtual envi- 
ronments). This has supposedly led to the notion of immersion being an objective 
property of the system or the technology that facilitates the experience. In Slater 
and Wilbur’s [60] words, “Immersion is a description of a technology, and describes 
the extent to which the computer displays are capable of delivering an inclusive, 
extensive, surrounding and vivid illusion of reality to the senses of a human partic- 
ipant.” In this regard, immersion is seen as the capability of the system/technology 
to support the different modalities, deliver sensory information, and provide interac- 
tion capabilities. Slater rejects the idea of immersion being a subjective experience. 
Instead, he views immersion as an objective property of the system that consists of 
reproduction fidelity of the different modalities, isolation from the physical world, 
and behavioral fidelity among others [62]. These properties of the “immersive sys- 
tem” can lead to different subjective experience of place and plausibility illusion 
according to him [62].4 

It is important to state that approaching immersion as an objective property fails to 
consider the perceptual limits, context, and individual factors such as mood, prefer- 
ence, expertise, and expectations. It has been established that an improvement in the 
technical specifications of the system does not necessarily lead to a proportional per- 
ceptual change (evident by non-linear psychophysical curves). Limiting immersion 
to the physical domain removes the sensory and cognitive filters that play an active 


3 Non-participatory activities are activities where the user’s actions cannot modify the outcome 
of the activity. Reading a book or watching a movie are examples of non-participatory activities. 
Contrarily, playing a video game where the user’s inputs can have an impact on the storyline they 
experience (e.g., Grand Theft Auto 5) is an example of a participatory activity. 


4 Slater’s description of place illusion is synonymous to the idea of physical presence as stated by 
him in [63]. 
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Fig. 11.2 Filter model and the suggested terms for referring to the perspectives on immersion in 
the different domains 


part in determining the overall experience. It has been appropriately suggested that 
the term system immersion [61] should be used when referring to this perspective on 
immersion. 

The ideas of immersion being an objective property and perceptual or sensory 
immersion are closely related. An improvement in the technical specifications of the 
system such as an increase in the number of loudspeakers can increase the sensory 
information leading to psychological immersion as explained in Sect. 11.2.1.1. This 
can give the impression that it is the system or the sensory information that leads to 
psychological immersion. While the system is a factor that can influence immersion, 
it is not the only factor as the physical perspective suggests. 


11.3 Immersion: A Cognitive Concept 


It is clear from the preceding section that we must organize the usage of the term 
immersion to communicate the intended ideas and conduct research on the subject. 
We use the filter model [6, 33] to differentiate and categorize the ideas conveyed 
by the common term. The model (depicted in Fig. 11.2) has been used for sensory 
analysis in food science, sound and image quality evaluations, and to study the spatial 
characteristic of sound among others [6]. 

The model starts with the physical domain which houses a physical stimulus 
(e.g., a music signal played by a loudspeaker). The stimulus is characterized by 
the physical measurements of the audio frequency content, spatial audio channels, 


5 Part of this section has been copied from [3] upon receiving the publisher’s approval. 
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etc. The stimulus is perceived after passing through the sensory filter when it is 
transformed by the sensory system (e.g., auditory system) to neural energy. The 
result is an auditory event which is comprised of attributes of sound (e.g., loudness, 
envelopment). The elicitation of the attributes and their strength depends upon the 
characteristics of the physical stimulus and the sensory system. The auditory event 
can be evaluated by perceptual measurements in the perceptual domain. Finally, to 
form an overall impression of the auditory event, the perception passes through the 
cognitive filter which accounts for emotional state, expertise, expectations, mood, 
context, etc. The cognitive factors and the individual attributes from the perceptual 
domain contribute to the overall impression which requires an integrative frame of 
mind. These affective or hedonic measurements include assessment of quality, degree 
of liking, annoyance, acceptance, etc. 

The filter model is simple yet powerful as it allows us to evaluate the influence 
of physical parameters of the system and signal on affective measurements by link- 
ing the two domains through the perceptual domain. The primary ideas conveyed 
by the use of the term immersion can be categorized using the filter model. First, 
we have Slater’s idea of immersion as being an objective property of the technol- 
ogy/system. Slater [62] has stated that “Let’s reserve the term ‘immersion’ to stand 
simply for what the technology delivers from an objective point of view. The more 
that a system delivers displays (in all sensory modalities) and tracking that preserves 
fidelity in relation to their equivalent real-world sensory modalities, the more that it 
is ‘immersive’.” Slater [61] has suggested the term system immersion to denote his 
understanding of immersion which is in the physical domain. Second, immersion is 
used to refer to the sense of being surrounded by a stimulus (see Sect. 1 1.2.1.1). This 
is the perception of the stimulus and thus exists in the perceptual domain. We recom- 
mend that the term perceptual immersion should be used when referring to the feeling 
of being surrounded. The goal with surrounding the user with a stimulus is often done 
with the hope of eliciting psychological immersion as both perceptual attributes and 
cognitive factors contribute to affective measurements as explained in the preceding 
paragraph. Finally, the idea of psychological immersion (see Sect. 11.2.1) or involve- 
ment/absorption in the activity can be explained in the cognitive domain. The user 
(their personal characteristics) plays an important role in the experience of psycho- 
logical immersion but the perceptual attributes (e.g., envelopment, naturalness) can 
influence psychological immersion. 

Our motivation for studying immersion is to identify the influencing factors so 
that they may be tuned to enhance experiences. The role of the individual is of 
utmost importance since experiences are, by their very nature, subjective. Thus, it 
is important to consider the holistic experience instead of focusing on individual 
parts that contribute to the experience. Assessment of audiovisual experiences has 
been historically driven from a bottom-up approach beginning from the technical 
specifications of the system that facilitates the experience. However, improvements in 
the technical capabilities of the system may not always lead to a perceptual difference 
(e.g., when the improvements are smaller than the just noticeable difference or beyond 
the thresholds of the human sensory system), rendering them insignificant for the 
goal of improving experiences. Therefore, we advocate a top-down approach where 
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the idea is first studied holistically (in the cognitive domain) and then empirical 
relationships are forged to the technical parameters of the system (physical domain). 

We view immersion from a psychological perspective (similar to Sect. 11.2.1). For 
the remainder of this chapter, our usage of the term immersion refers to psychological 
immersion unless noted otherwise. Synthesizing from the descriptions of immersion 
appearing in literature, we propose the following definition of immersion that can be 
applied to a wide range of applications: 


Immersion is a phenomenon experienced by an individual when they are in a 
state of deep mental involvement in which their cognitive processes (with or 
without sensory stimulation) cause a shift in their attentional state such that 
one may experience disassociation from the awareness of the physical world. 


We consider immersion to be a normal occurrence of focused attention (on the 
activity) during waking consciousness. During immersion, the mind is absorbed in 
the current motivated activity and conscious attention is focused on the features of 
the situation that are related to the achievement of the intended goal. Still, during 
most normal circumstances the mind can easily be disturbed by extrinsic factors 
(e.g., noise in the environment), intrinsic dynamic tendencies (e.g., unfinished tasks 
or obligations), and random noise. Unlike hallucinations and dreaming during sleep 
states, the mind is still attentive or watchful (to some degree) to the occurrences in the 
world and monitors the present state of the body when immersed in a construction 
built by intrinsic factors. When something of significance for the maintenance of 
the subject’s life and well-being occurs, the perturbations may usually rather easily 
destabilize the current state, change the focus of attention, and propel the mind into 
another and more stable attractor of orientation and search for the nature of the 
disturbance. For detailed discussions of consciousness, the reader is referred to [15, 
18, 20, 38, 50]. 

Involvement in the current view necessitates an interaction between the subject 
and the system not only in a physical sense (the completion of a series of actions and 
operations upon the system) but also in a psychological sense (the interaction between 
the subject’s motives for the interaction with the system and the system’s objective 
capabilities for the pursuit of the subject’s motives). Based on the proposed definition, 
immersion is a mental state which is why sensory stimulation is not required to 
experience immersion (e.g., daydreaming can be an immersive experience). 

It is imperative to consider all sensory modalities for determining immersion 
since the presented stimuli may stimulate only a few senses but we continue to 
receive input from all the senses that can influence immersion. Therefore, all factors 
that can either facilitate or disrupt immersion must be considered. It is unreasonable 
to merely examine the stimulus or the system to determine immersion. While the 
system and the content can affect immersion, they are not immersive independent 
of the human subject. The idea of immersive potential can add clarity to the above 
explanation. 
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Immersive potential: The potential of a system or content to elicit immersion. 


For a given piece of content presented by a system which does not change, the 
immersive potential remains constant. It does not simply increase with the betterment 
of the system’s technical specifications. Instead, it depends on its ability to elicit 
immersion. The immersive potential is barred by the human perceptual limits and 
the changes to a system must lead to a discernible perceptual change to alter its 
immersive potential. 

In addition to the system and the content, immersion also depends on the state of 
the individual at the moment in time as well as their immersive tendency. 


Immersive tendency [70]: An individual’s predisposition to experience immer- 
sion. 


The immersive tendency can be determined with the help of questionnaires [69, 
70] to learn if certain individuals can get immersed relatively easily compared to 
others. It can be assumed to stay constant over the course of an experiment which is 
conducted within a short duration of time.® 

The five factors that can influence immersion are (1) the system (physical prop- 
erties of the reproduction system and the content), (2) narrative (content), (3) envi- 
ronment (physical environment around the individual and the contextual conditions), 
(4) individual factors (affective states, mood, preference, skills, previous knowledge, 
expertise, goals, motivation, etc.), and (5) interaction between the individual and the 
experience (significance of the content to the individual, acceptance of the task, 
alignment of goal and motivation). These are similar to those which affect the qual- 
ity of experience (QoE) [54] since immersion is an experience that is dependent on 
an individual’s cognitive state and preference for the content. Nonetheless, there is 
a noteworthy distinction between the concept of QoE and immersive experiences. 
This is explained in the following subsection. 


11.3.1 Quality of Experience (QoE) and Immersion 


The concept of quality of experience (QoE) was introduced in the field of telecom- 
munication and multimedia services. It is the successor to quality of service experi- 
enced (QoSE) which is the successor to quality of service (QoS).’ The progression 
from QoS to QoE has shifted the approach to quality from technology-centric to 
user-centric. It is important to note that this shift is consistent with the widespread 
acknowledgment that only the end users are capable of judging quality [49]. Although 


6 Immersive tendency can change over time due to training, learning, experience, changes in per- 
sonality, etc. Since these factors do not normally vary within a short duration of time (e.g., over the 
course of a few days), these can be assumed to be constant for conducting experiments. Nevertheless, 
it is recommended to limit tests to a single session. 


7 A detailed discussion of the terms and their relationship to QoE is beyond the scope of this chapter. 
Please refer to [28, 49, 67] for an extensive review. 
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several definitions of QoE are in use, the following definition by Raake and Egger [49] 
(based on the definition proposed in the Qualinet white paper [47]) provides a com- 
plete and functional description of the concept: 


Quality of experience (QoE) is the degree of delight or annoyance of a person 
whose experience involves an application, service, or system. It results from 
the person’s evaluation of the fulfillment of his or her expectations and needs 
with respect to the utility and/or enjoyment in the light of the person’s context, 
personality, and current state. 


The act of experiencing does not constitute quality judgement [49]. Evaluating 
quality requires cognitive processes in addition to those engaged during the act 
of experiencing [49]. Please refer to [30, 48] for additional information on quality 
formation process. QoEis a two-step process comprising of experiencing and forming 
a quality judgement. This is a major point of distinction between immersion and QoE. 
Immersion is the state of being mentally absorbed in an experience whereas QoE is 
the evaluation of quality for any experience, immersive or not. 

An immersive experience is an experience where immersion is elicited. The qual- 
ity of such an experience may be determined by methodologies inspired by QoE eval- 
uations. Thus, we place immersion on a level below QoE in the hierarchy. Immersion 
may be a factor that can influence QoE but the scientific evidence is yet to emerge. 


11.4 Differentiating Immersion from Interchangeably Used 
Terms 


The preceding section presented a detailed explanation of immersion that is synthe- 
sized from a non-exhaustive literature review. To establish the terminology and add 
clarity to the concept of immersion, a brief review of interchangeably used terms 
is presented in the following sub-subsections and the ideas are differentiated from 
immersion. 


11.4.1 Presence 


Presence has been an important research topic for technology mediated experiences. 
Initially, presence referred to the experience of perceiving the physical environment 
and did not entail the use of technology [64]. However, presence is used in a much 
broader sense today. It is generally understood as “a psychological state or subjective 
perception in which even though part or all of an individual’s current experience is 
generated by and/or filtered through human-made technology, part or all of the indi- 
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vidual’s perception fails to accurately acknowledge the role of the technology in the 
experience” [17]. This definition refers to what is known as physical presence (also 
called place illusion). Presence is also classified as social presence (the experience 
of being together with others) and co-presence (being together in the same physical 
space). The discussion here is limited to physical presence since it is the one that is 
often confused with immersion. 

Place illusion (physical presence) and plausibility illusion? are required for real- 
istic behaviors in virtual environments [63]. Place illusion is a technology mediated 
illusion where the user has the feeling of being in a real space which is not the actual 
physical space they are in. Slater [62] views place illusion as a subjective response to 
system immersion. He explains that “if immersion [system immersion] is analogous 
to wavelength distribution in the description of color then “‘presence’ is analogous 
to the perception of color.” In this sense, presence is a perceptual attribute that is 
directly influenced by the properties of the system. To extend this in the context of the 
filter model, liking or the quality of presence would represent the overall impression 
of the experience in the cognitive domain. 

When explaining why people often report the sense of “being there” when engag- 
ing with systems possessing low system immersion, Slater hypothesized that the 
reported presence experiences were qualitatively different from those encountered 
due to objectively better systems (higher system immersion) [44]. He [63] asserted 
that presence due to superior systems is caused because of the exposure to the sen- 
sory stimuli while presence experienced due to relatively inferior systems requires 
focused attention and deliberate learning [63]. Slater [63] goes on to state that “[the 
feeling of presence due to low system immersion] it is not simply a function of 
how the perceptual system normally works, but is something that essentially needs 
to be learned, and may be regarded as more complex.” This explanation is at odds 
with the psychophysics-based description he has provided using the analogy to color 
perception. Although it has been argued that cognition plays a role in determining 
presence [59], the sensory information delivered by the system is paramount [62]. 
Please refer to [44] for an overview of presence theories. 

At this stage, itis important to distinguish between place illusion and our definition 
of immersion which was presented in Sect. 11.3. Foremost, immersion is mental 
absorption in the activity whereas presence is the feeling of being in an unmediated 
environment even when the contrary is true. It follows that immersion resides in 
the cognitive domain whereas descriptions of presence suggest that it is a perceptual 
attribute. Secondly, presence requires technologically mediated experiences whereas 
immersion can be experienced even without sensory stimulation from the system. 

We follow Jennett et al.’s [31] notion that the two concepts are independent and a 
double dissociation exists between immersion and presence. For participatory activi- 
ties, immersion can be experienced when playing abstract games such as Pac-Man on 
a mobile phone but it is unlikely that the user will feel that they are present in the game 
environment. Similarly, a high fidelity audiovisual reproduction of an uninteresting 


8 


8 The illusion that the events in the virtual environment is actually happening even when you know 
that they are not [63]. 
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movie in virtual reality can deliver the illusion of being in an alternate environment 
but will fail to deliver an immersive experience. Nonetheless, it is important to note 
that presence and immersion can coincide as is often the case for engaging virtual 
reality experiences, for example. 


11.4.2 Flow 


The concept of flow was developed in the 1960s through a series of studies con- 
ducted to understand why people pursue arduous and often dangerous activities in the 
absence of discernible extrinsic rewards [12]. Multiple definitions and descriptions 
of flow have been presented including, “the holistic sensation that people feel when 
they act with total involvement” [10]; “a subjective state that people report when 
they are completely involved in something to the point of forgetting time, fatigue, 
and everything else but the activity itself” [12]; and “the state in which people are 
so involved in an activity that nothing else seems to matter; the experience itself is 
so enjoyable that people will do it even at great cost, for the sheer sake of doing it” 
[11]. Csikszentmihalyi [12] identified eight components of flow: clear goals, direct 
and immediate feedback, altered sense of time, loss of self-consciousness, concen- 
tration, balance between ability and challenge, sense of control, and escape from 
everyday life. However, researchers have not yet established the conditions that must 
be fulfilled for an experience to qualify as flow [66]. 

There is an evident overlap between flow and immersion, but the two are not 
synonymous. Immersion is a graded experience [2, 8] whereas flow is an “all-or- 
nothing” experience [9]. Flow is an optimal experience that is always enjoyable 
whereas enjoyment is not mandatory for immersion, i.e., an individual can experience 
negative emotions when immersed but it will not qualify as a state of flow since it is not 
pleasant. Additionally, the concept of flow is limited to interactive activities because 
flow components such as clear goals and immediate feedback are not applicable to 
passive activities. It has been argued that immersion is a precursor to flow, but flow is 
not simply the highest degree of immersion [31]. For instance, a passive, unpleasant 
experience can be highly immersive but will fail to qualify as flow due to a lack of 
enjoyment and the interactive components that constitute flow. 


11.4.3 Envelopment 


Envelopment is an important topic in concert hall acoustics and spatial audio. It 
is classified as listener envelopment (sense of being surrounded by the reverberant 
sound field) [56] and source-related envelopment (envelopment by sounds placed 
around the listener) [19]. Itis clear from the literature [33] that envelopment is strictly 
a perceptual attribute. However, it continues to be confused with immersion. There are 
two reasons that can explain the replaceable usage: (1) use of the common analogy: 
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“feeling of swimming underwater” to illustrate immersion as well as envelopment, 
and (2) approaching immersion as perceptual immersion (see Sect. 11.2.1.1) makes 
the two synonymous. 

The predominant difference between envelopment and immersion is that the for- 
mer is perceptual while the latter is affective since it is an integrative measure that 
accounts for cognitive factors. A double dissociation exists between immersion and 
envelopment. For example, monophonic reproduction in a non-reverberant environ- 
ment will not elicit the feeling of envelopment but it can be immersive. Similarly, an 
accurate reproduction of a soundscape is unlikely to be immersive due to a lack of 
an engaging narrative but will be reported to be highly enveloping. Nevertheless, it 
should be noted that envelopment and immersion can coexist. Further, envelopment 
can lead to immersion in an experience since sense of being surrounded is one of the 
reasons that can lead to psychological immersion (see Fig. 11.1). 


11.5 Subjective Assessment of Immersion: An Exploratory 
Study 


Quantification of immersion is the immediate step following the theoretical concep- 
tualization of the topic. Nevertheless, a lack of established experimental paradigms 
for the assessment of immersion is the greatest challenge in developing our under- 
standing of the topic. A lack of a consensus on the idea of immersion, fragile nature 
of immersive experiences [43], and limited information about the factors and their 
influence on immersion add to the complexity of quantifying immersion. Methodolo- 
gies for assessing immersion can be classified as subjective and objective measures.” 
An outline of these is presented below. 


11.5.1 Subjective and Objective Measures 


Subjective measurement paradigms ask the participants to reflect on their expe- 
rience and form a conscious judgement. Questionnaires, focus groups, think aloud 
paradigms, and interviews are examples of subjective measures. These are conducted 
post-experience in order to avoid infringing on the experience. Thus, they are less 
susceptible to the emotional and physiological idiosyncrasies. Subjective measures 
are attractive as they are non-invasive and easy to interpret for the participants. They 
allow researchers to explore multiple facets of immersion (e.g., emotions, mental 
and physical awareness, liking, etc.) as the areas of interest can be multiple items 
on a questionnaire or be verbally questioned in an interview. Moreover, subjective 


? Please refer to [3] for literature on subjective and objective measures. Additionally, Zhang [71] 
has provided a detailed discussion on the pros and cons of the measures. 
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measures are excellent for determining individual differences as the responses can 
be directly compared and analyzed. 

The simplicity and effectiveness of subjective measures is appealing but the draw- 
backs must be considered to select suitable experimental paradigms. Foremost, the 
post-experience nature of these measures can lead to inaccurate recall and recency 
effect. These can be particularly problematic when longer stimuli are used for eval- 
uations. The retrospective recall also restricts the evaluation of temporal variations 
in immersion. Finally, for subjective measures that are based on a set of predefined 
questions (e.g., questionnaires), there is a risk of failing to capture all the aspects of 
the immersive experience that are beyond the scope of the listed items. 

In contrast to subjective measures, objective measures attempt to record the user’s 
response without requiring conscious evaluation and correlate those responses to 
immersion and/or its attributes. Behavioral and physiological measures are the two 
types of objective methods used for assessing immersion. The former includes mea- 
sures such as secondary task reaction time (STRT)!° while the latter involves the 
use of biological sensors to measure physical response to the stimulus (e.g., elec- 
troencephalography, eye tracking, and electrodermal activity). These methods do 
not allow for the direct measurement of immersion. Instead, the recorded response 
is correlated to immersion or the suspected to be attributes of immersion. 

The objective and non-intrusive aspect of objective measures yields an accu- 
rate, time-variant measurement of concept under evaluation. Since the deliberate 
judgement formation process is eliminated unlike subjective measures, the measure- 
ments are not influenced by the various biases associated with subjective evaluations. 
The single most important criticism of objective measures is the lack of established 
relationship(s) between immersion and what is measured. Hence, there is a risk of 
measuring an idea that may not be related to immersion or differently related than 
assumed. In addition to the lack of one-to-one mapping, physiological signals can 
be highly sensitive, require specialized equipment in controlled environments, and 
may need extensive data analysis procedures. 


11.5.2 Research Questions 


An experiment was conducted to develop and test a suitable methodology for assess- 
ing immersion as a necessary first step.!! Answers to the following research questions 
were sought in the study: 


10 The premise for STRT is that our cognitive resources are limited. Thus, if resources are largely 
expended on the primary task, less resources will be available for the secondary task which will 
reflect in the efficiency with which the secondary task is performed. The level of immersion can 
hence be measured by the performance of the secondary task, i.e., when STRT performance is low, 
high level of immersion was experienced. 


11 The experiment is covered in detail in [2]. Here, only the key points are discussed. 
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RQ1 How can immersion in an audiovisual experience be quantified through sub- 
jective testing? 

RQ2 Is immersion a binary (all-or-nothing) or graded experience? 

RQ3 What is the influence of immersive tendency on immersion ratings? 


11.5.3 Experimental Strategy and Design 


Subjective and objectives tests each have their advantages and disadvantages as dis- 
cussed in Sect. 11.5.1. The fundamental issue with physiological measures is the lack 
of established links between what is measured and immersion. Thus, one cannot be 
certain if what is being measured is immersion or is related to immersion in a quantifi- 
able way. Experimental designs that incorporate behavioral responses such as STRT 
are potential alternatives but have failed to yield conclusive results. The limitations 
with objective measures limit us to subjective assessment of immersion [40]. 

Subjective assessment of immersion has been predominantly conducted using 
questionnaires. However, since the questionnaires often have in excess of 25 items, 
administering them for each experience adds to the experimental time and the work- 
load for the participants. Further, questionnaires fail to capture the unexpected aspects 
of immersion or those that are unaccounted for in the set of questions [71]. Jennett 
et al. [31] compared the results from a questionnaire to that of a single question on 
the immersion experienced by the participants. Their experiments revealed that “peo- 
ple can reliably reflect on their own immersion in a single question” when grading 
immersion on a categorical 10-point scale. This is an important finding as it implies 
that immersion experiments may be conducted as rating experiments. Since rating 
experiments are the norm for audiovisual assessment and participants are familiar 
with the general paradigm, it was decided to conduct the experiment as a rating 
experiment. 

Before the experimental design could be developed, it was necessary to outline 
the theoretical implications on the experimental paradigm. First, the participants 
cannot be permitted to switch between stimuli for making comparative judgments 
as it will destroy the state of immersion [3]. Similarly, the evaluations must be 
made post-experience. Second, it is hypothesized that individuals require time to 
return to their base or initial state after an immersive experience. Distractor tasks 
can be incorporated to shift attention away from the experience between consecutive 
presentations. Third, the experiment must be completed in a single session since 
participants experience fatigue faster in non-participatory tasks [29] and time can 
alter individual factors such as mood. Finally, each participant should be limited 
to one instance of any stimulus due to limited information regarding the effect of 
repetition on immersion. 

With the implications in mind, a pilot test was conducted as a randomized com- 
plete block design to aid with the selection of stimuli and to test the protocol. Six 
participants each graded the same set of 5 stimuli. The pilot test results suggested 
that the session should be limited to 75—80 min in order to avoid participant fatigue. 
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Since a large number of stimuli had to be tested (particularly for RQ2) and repeti- 
tions were prohibited, a balanced incomplete block (BIB) design was determined to 
be the most appropriate choice for the main study. A major drawback of a simple 
BIB design is that as the number of stimuli to evaluate increases, the number of 
participants required increases drastically (provided the number of evaluations each 
participant performs does not change). Thus, precision had to be traded to reduce the 
number of participants required for the experiment [35]. 

The simple BIB design was reduced to a BIB design with 21 blocks (participants) 
and 15 treatments (stimuli). Every participant evaluated a subset of 5 stimuli from 
the set of 15. The stimuli were allocated such that each pair of stimuli (e.g., A and F) 
would only appear together in two blocks (i.e., only two participants would get both 
A and F). In total, there were 7 instances of each of the 15 stimuli that yielded 105 
total observations as 21 participants graded 5 stimuli each. The allocation of the 
stimuli to the different blocks is shown in Table 11.1. 


Table 11.1 Allocation of stimuli to the experimental blocks for the balanced incomplete block 
(BIB) design used in the study. Reproduced from [2] 


Block Exp. 1 Exp. 2 Exp. 3 Exp. 4 Exp. 5 
1 D J M N (0) 
2 A B H I M 
3 A C E F (0) 
4 C E G H L 
5 B D E J K 
6 B C D G (0) 
T F G K M (0) 
8 C F I J M 
9 A D G L M 
10 A C J K L 
11 C D H I K 
12 B F G H J 
13 H J L N (0) 
14 D E F I L 
15 B F K L N 
16 A G I K N 
17 A B I L (0) 
18 B C E M N 
19 E G I J N 
20 A D F H N 
21 E H K M (0) 


The pre-fix “s” used for representing the stimuli is dropped in this table for clarity. 
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11.5.4 Methods 


11.5.4.1 Program Material 


There are various implications for selecting the program material for assessing 
immersion (see [3]). Foremost, the relevance of the program material to the par- 
ticipant plays a role in determining immersion and can vary among participants. 
Thus, it should not be assumed that any given stimulus can immerse all participants. 
Additionally, since knowledge and expectations may change with every trial of a 
stimulus, an assessor may not experience immersion during repeated presentations 
of the same stimulus. 

Itis important to select audiovisual excerpts with lengths sufficient to elicit immer- 
sion. It has been recommended that stimuli that are at least 10 min long must be 
used [29], but there is limited information regarding the temporal nature of immer- 
sion. The recommendation is focused on participatory activities, and we suspect that 
the length of the stimulus can be lower and is dependent on the narrative. Thus, 
excerpts ranging from 4 to 12 min were selected for this study. 

Given the lack of knowledge regarding the effect of familiarity on immersion, 
it is suggested that content that is unfamiliar and that does not require additional 
background information must be selected. However, this stipulation limits the amount 
of content that can be selected. Therefore, it was decided to provide the participants 
with a short synopsis (1—2 sentences) regarding the narrative before each presentation. 
These were constructed only to include any relevant information required to make 
sense of the story and did not disclose any additional information. 

Finally, to select the technical specifications of the excerpts, an informal survey 
of the domestic media landscape suggested that ultra high-definition (UHD), high 
dynamic range (HDR) visuals and spatial audio are emerging for domestic consump- 
tion. These are being incorporated by broadcasters, streaming platforms, and movie 
studios alike. Thus, it was decided to use a 7.1.4 audio rendering system coupled 
with an UHD HDR enabled screen. The 7.1.4 audio reproduction system was chosen 
as it was revealed to be the most common spatial audio reproduction system beyond 
traditional surround sound for domestic applications. 

Audiovisual excerpts of different lengths and narratives that can elicit spatial, 
emotional, and temporal immersion were chosen for the experiment. An active effort 
was made to select stimuli that were distributed across the immersion scale as the 
results are directly dependent on the stimuli. The selection was made based on the 
pilot experiment and comments received from the pilot test participants because the 
technical specifications could not be used to choose the excerpts. A list of the excerpts 
and the genres is presented in Table 11.2. 

The fundamental challenge with selecting stimuli that has UHD HDR visuals 
coupled with spatial audio was a lack of freely available content. Hence, commer- 
cially available content with Dolby or DTS audio had to be used for this experiment. 
Fifteen audiovisual excerpts that fulfilled the above-stated conditions were selected. 
The resolution, native aspect ratio, and chroma sub-sampling were not changed for 
reproduction. The audiovisual signals were not processed at any stage. 
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Table 11.2 Audiovisual excerpts used in the experiment. Reproduced from [2] 
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Excerpt Content Genre Year | Timecode 

Example | Earth: One Amazing Nature 2018 | 00:08:50 — 00:16:49 
Day 

sA Mission: Impossible — | Action 2018 = | 01:12:31 — 01:16:09 
Fallout 

sB Apocalypse Now — War/Drama 2019 | 02:12:45 — 02:20:24 
Final Cut 

sC The Revenant Adventure 2016 |01:53:09 — 01:58:24 

sD Fantastic Beasts: CG * | Fantasy/Adventure 2019 |01:34:50 — 01:42:47 

sE Dynasties: Lion Nature 2018 |00:16:11 — 00:20:00 

sF The Darkest Hour War/Drama 2018 | 00:41:09 — 00:48:00 

sG Murder on the Orient | Mystery/Drama 2018 | 00:00:53 — 00:08:31 
Express 

sH Braveheart War/Drama 2018 | 00:22:05 — 00:28:36 

sI Ad Astra Sci-fi 2020 |01:15:23 — 01:21:17 

sJ Earth: One Amazing Nature 2018 | 00:57:50 — 01:02:39 
Day 

sK Spider-Man: Into the Animation/Action 2019 | 00:02:32 — 00:13:42 
Spider- Verse 

sL The Revenant Adventure 2016 | 00:02:30 — 00:14:59 

sM Sicario Crime/Action 2018 | 00:01:00 — 00:12:53 

sN Earth: One Amazing Nature 2018 | 00:47:47 — 00:51:37 
Day 

sO Earth: One Amazing Nature 2018 | 00:16:50 — 00:22:35 


Day 


a Crimes of Grindelwald 


Notes: 


1. The year of release refers to the UK year of release on 4K Blu-ray. 


2. Please refer to Table 5 in [2] for the corresponding narrative synopsis. 


11.5.4.2 Reproduction Setup 


The audiovisual excerpts were presented directly from the Blu-ray player to every 
participant due to legal limitations. An HDCP compliant video switcher and the 
Genelec loudspeaker manager (GLM) were used to control the video and the audio 
respectively. The complete audiovisual signal chain is depicted in Fig. 11.3. 

A 7.1.4 audio rendering system was used for audio reproduction. The audio was 
decoded by the Marantz AV7704 and the decoded channels were mapped to the 
corresponding loudspeaker channels. A phantom reproduction of the center audio 
channel was used since it was not feasible to have a physical loudspeaker due to the 
screen. The Genelec loudspeakers were distributed on a hemisphere with a 2 m radius 
around the listening position. The placement of the loudspeakers was in accordance 
with Dolby guidelines [13]. 
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Genelec Speakers Monitoring 
GLM Control fttt 10 x 8340a Sereen i 
1 x 7340a : 


Sony UBP-X500 MAAN 1/08 Roland V-600 LG C9 65-inch 


UHD Blu-ray Player Audioyisual Video Switcher OLED screen 
Processor 


Fig. 11.3 Audiovisual reproduction signal flow. The different line types refer to: HDMI 2.0 and 
HDCP 2.2 connection (——), analog audio feed over XLR (- - -), remote loudspeaker control over 
Ethernet (------ ), and HDMI 1.4 connection (--------- ) 


The loudspeakers were level calibrated and time aligned with respect to each 
other. To achieve approximately equal loudness among the stimuli, the level was 
varied such that the audio was equally loud at the listening position as determined 
by ear. All excerpts were auditioned by two experienced listeners to ensure that the 
stimuli were at comfortable loudness and that the audio was intelligible during the 
quieter segments. 

A 65-inch LG C9 OLED screen was used to reproduce the visuals. The screen 
was centered with respect to the participants to obtain a zero degree viewing angle 
horizontally and vertically. It was placed at a distance of 2 m (same as the loudspeak- 
ers) following the design viewing distance in ITU-R BT.2022 recommendation [26]. 
To balance the judder of 24p video signal while exploiting the high dynamic range 
(HDR) capabilities of the screen, the screen brightness was lowered to nearly 120 
nits and the environmental illuminance was less than 10 lux. The screen settings were 
tuned by two experienced viewers in part to get the chromaticity coordinates closer 
to the D65 value [27] and in part based on experience. 

The audiovisual reproduction took place in an IEC 60268- 13 [25] compliant listen- 
ing room. All equipment except the screen and the loudspeakers were placed outside 
the room. The loudspeakers were hidden behind acoustical transparent curtains to 
limit visual influence. 


11.5.4.3 Distractor Tasks 


It was hypothesized that some time is required to return to the initial or base psycho- 
logical state after an engaging experience. Presentation of stimuli in quick succession 
may lead to the preceding stimulus biasing the result of the following stimulus. In 
the absence of formal guidelines and conclusive evidence regarding the gap in time 
between presentation of stimuli, distractor tasks were incorporated in an attempt to 
shift attention away from the preceding stimulus by requiring active participation. 
A 11 piece LEGO® unicorn puzzle (only instruction was to create a unicorn), an 
image for free interpretation,!* a matchstick rearrangement puzzle, and a memory 


12 The participants were guided by asking them to note their impression of what was happening in 
the picture, what led to that impression, and what additional information could they collect. 
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(a) Matchstick puzzle (b) Memory task (c) Lego puzzle (d) Image for 
interpretation 


Fig. 11.4 The four distractors tasks: a Matchstick rearrangement puzzle b Memory task (7 x 6 
tiles) c LEGO® puzzle d Image for free interpretation (obtained from the New York Times). All 
images reproduced from [2] 


puzzle were the four distractor tasks. One task was chosen at random to be completed 
within four minutes between each successive presentation. The assessors were made 
aware of the correct solution for the matchstick and memory puzzle tasks before 
proceeding (Fig. 11.4). 


11.5.4.4 Immersive Tendency Questionnaire 


Questionnaires are the primary tool for gauging immersive tendency. Reduced ver- 
sion of widely used [23, 32, 34, 42] Witmer and Singer’s [70] immersive tendency 
questionnaire (ITQ) was used for this study (see Table 11.3 for questionnaire items). 
Nonetheless, a few modifications were made to the existing questionnaire. The seven 
point categorical scale was substituted by the graphic line scale (also used for rat- 
ing immersion) to obtain continuous data; middle word anchor from the categorical 
scale was dropped as it has been shown that scores can cluster around the verbal 
anchor [72]; and the terminal verbal anchors were modified to be perfect antonyms 
(similar modification was made in [23]). The order of questions was randomized for 
the participants. All assessors answered the ITQ. 


11.5.4.5 Assessors 


The participants were considered a blocking factor (see Sect. 11.5.3) in the exper- 
imental design. Twenty-one assessors (blend of experienced and inexperienced) 
were each assigned to a block at random. Audiovisual assessment expertise was 
not required since immersion is a cognitive concept. For this study, experienced 
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Table 11.3 Witmer and Singer’s [70] Immersive Tendency Questionnaire (ITQ). The items in the 
reduced version of the questionnaire and the corrected item-total correlations from the present study 
are shown below. Please refer to [2] for analysis of the questionnaire data. Taken from [2] 


Question Corrected item-total 
correlations* 

Do you easily become deeply involved in movies or TV 0.63 

dramas? 

Do you ever become so involved in a daydream that you are not | 0.44 

aware of things happening around you? 

Do you ever have dreams that are so real that you feel 0.34 

disoriented when you awake? 

When watching sports, do you ever become so involved in the | -0.12 

game that you react as if you were one of the players? 

How good are you at blocking out external distractions when 0.00 

you are involved in something? 

Have you ever remained apprehensive or fearful long after 0.32 

watching a scary movie? 

Have you ever gotten scared by something happening ona TV | 0.31 

show or in a movie? 

Do you ever become so involved in a video game that it is as if | 0.65 

you are inside the game rather than moving a joystick and 

watching the screen? 

How often do you play arcade or video games? (OFTEN should | 0.41 

be taken to mean every day or every two days, on average) 

Have you ever gotten excited during a chase or fight scene on 0.65 

TV or in the movies? 

How well do you concentrate on enjoyable activities? 0.05 

Do you ever become so involved in a television program or 0.26 

book that people have problems getting your attention? 

How mentally alert do you feel at the present time? 0.33 

How physically fit do you feel today? 0.17 

How frequently do you find yourself closely identifying with 0.58 

the characters in a story line? 

When playing sports, do you become so involved in the game 0.25 

that you lose track of time? 

Do you ever become so involved in a movie that you are not 0.55 

aware of things happening around you? 

Do you ever become so involved in doing something that you 0.68 

lose all track of time? 

What kind of books do you read most frequently? Select one 

Spy novels Fantasies Science Fiction 

Adventure Romance novels Historical novels 

Westerns Mysteries Other fiction 

Biographies Autobiographies Other non-fiction - 


à The corrected item-total correlations are Pearson product-moment correlations between the item 
and the sum of all items except the item it is being tested against. These numbers are from the 
current study. 
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assessor refers to participants who had experience participating in audiovisual tests, 
were under continuous weekly training and evaluation exercises, and participated in 
product development or research activities at Bang & Olufsen. Inexperienced asses- 
sors may have participated in audiovisual tests before but were not familiar with 
subjective evaluation, did not have formal training, and were not actively focused on 
the technical aspects of audiovisual products or experiences. In total, fifteen males 
and six females participated in the experiment. The mean age of the participants 
was 37.7 years (SD = 14.28). Auditory and visual acuity was self-reported by the 
participants. 


11.5.5 Procedure 


The experiment included two phases: rating part and administration of the immersive 
tendency questionnaire. Both were completed in a single session of approximately 
90 min. 

The participants were introduced to the experimental procedure and asked to con- 
firm visual and auditory acuity before participating. The instructions were delivered 
verbally and in writing. For the rating phase, the participants were given the following 
description of immersion as stated in [2]: 

“Immersion, also known as deep mental involvement, can be described as being 
mentally lost (absorbed) in the experience. Immersion is encountered when the expe- 
rience is involving and absorbs you mentally by capturing your attention. For exam- 
ple, immersion may be experienced when reading a book, playing video games, 
watching a movie, etc.” 

The participants were asked to rate overall immersion on a graphic line scale. The 
motivation for the scale is found in sensory analysis. It is a 15 cm long line scale where 
the participants are instructed to insert an intersecting line to denote their perception. 
The distance from the left end of the scale is considered to be the score (e.g., 6.8 
cm would equal to a rating of 6.8). The scale was chosen as it offers the participants 
infinite steps (in theory) to indicate the intensity of the idea under evaluation. The 
lack of numbers and verbal anchors (other than those near the endpoints) reduce the 
bias associated with them. The scale used for the test is shown in Fig. 11.5. 

In addition to rating, familiarity with the content was documented by asking 
assessors to state if they had experienced the excerpts previously. An excerpt that 


Sf 


Not very Immersion ————~> Very 
immersive immersive 


Fig. 11.5 Graphic line scale for evaluating immersion. Same scale with different verbal anchors 
was used for the immersive tendency questionnaire (from [2]) 
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could elicit immersion was shown before the test in an attempt to exemplify immer- 
sion. However, it was explicitly mentioned that it was only an attempt to illustrate 
immersion, should not be used as a reference, and may not lead to immersion for the 
participants. The participants were notified that there were no correct responses and 
that the use of the entire scale was not mandatory. 

A synopsis was provided before each experience. A distractor task was chosen at 
random to be performed between successive presentation of excerpts. The immersive 
tendency questionnaire was administered at the end of the rating phase. The experi- 
ment was conducted as a pen-and-paper test and the data was collected withing three 
weeks. 


11.5.6 Results 


Ratings from both phases of the experiment were converted to scores between 0 and 
15 (up to one decimal). The converted scores were used for analyzing the data. 


11.5.6.1 Effect of Stimuli and Differences Between Stimuli Pairs 


Data from the rating part of the experiment was analyzed using analysis of variance 
(ANOVA). Since the scale usage effects were confounded in the collected data and 
it was not feasible to account for and remove these effects, the estimated marginal 
means were used to estimating the effect [35]. A mixed effects model ANOVA with 
stimuli as a fixed factor and participants (blocks) as a random factor was used for anal- 
ysis. The trials were independent of each other and the assumption of homogeneity 
of variances was upheld. The residuals were not statistically significantly dissimilar 
from the normal distribution, W = 0.99, p = 0.710 as per the Shapiro- Wilk test. 

The ANOVA showed that the effect of the stimuli on immersion scores was signifi- 
cant, F (14, 74.82) = 3.32, p < 0.001. This proves that there were distinct differences 
between the pairs of stimuli and that the participants were able to distinguish between 
them. The blocking factor (participants) was not found to be statistically significant 
at p > 0.05. However, the effectiveness of the blocking factor is to control for the 
differences between the participants and cannot simply be judged by statistical signif- 
icance . Due to a lack of repetitions, the interactions between stimuli and participants 
factors could not be investigated. 

Pairwise comparisons were made between all pairs of stimuli on the basis of the 
estimated least square means. From the 105 pairs of stimuli, five were found to be 
statistically significant (Tukey’s adjustment). These pairs are marked in Fig. 11.6 
above the box plots. The results from the pairwise comparisons suggest that the 
stimuli fall in one of the three groups: where participants experienced high immersion 
(sB, sE, sL, and sM), low immersion (sA and sG), and moderate immersion (all 
remaining stimuli). 
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Fig. 11.6 Visualization of the raw data (not adjusted for scale usage) from the rating phase of the 
experiment. The significant stimuli pairs as determined by the pairwise comparisons are shown 
above the box plots 


11.5.6.2 Nature of Immersion: Binary or Graded? 


The distribution of raw immersion scores can reveal whether immersion is a binary 
or a graded concept. When a large number of stimuli are evaluated, the scores should 
cluster toward the ends of the scale if immersion is a binary concept, i.e., the distri- 
bution of scores should be bimodal. Hartigan’s dip test was used to determine if the 
distribution of immersion scores was unimodal or multimodal. 

Hartigan’s dip test is based on Hartigan’s dip test statistics (HDS). This statistic 
is the maximum difference between the empirical distribution function (EDF) and 
the uniform distribution that minimizes the difference between the distributions. The 
uniform distribution is chosen as it is the least favorable unimodal distribution [21]. 
A large difference between the distributions leads to higher HDS value and signals 
movement away from unimodality. To compute the p-value, bootstrapped samples 
are generated and their dip test value is compared iteratively to the dip test value 
obtained from the empirical distribution. Please refer to [21, 22] for an in-depth 
explanation of the mathematical calculations. The distribution of the dip statistic 
values for the bootstrapped samples and the empirical distribution function is shown 
in Fig. 11.7. 

The average p-value was 0.862 (o = 0.04) for 100 calculations at 5% significance 
level. The null hypothesis that the distribution of data is unimodal could not be 
rejected. This result implies that immersion is a graded concept. 
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Fig. 11.7 The distribution of the Hartigan dip statistic values for the bootstrapped samples and the 
empirical distribution function 


11.5.6.3 Influence of Immersive Tendency on Immersion ratings 


RQ3 was designed to study whether the susceptibility to become immersed in an 
experience has a direct influence on the immersion ratings. To this end, it was hypoth- 
esized that immersion ratings for any stimulus should increase with an increase in 
the ITQ total scores. Kendall’s rank order correlation was chosen to investigate if 
a monotonic relationship existed between the immersion and ITQ total scores. The 
value of Kendall’s t ranges between —1 and +1 where —1 signifies complete dis- 
agreement and +1 points to a perfectly monotonic relationship. A value of 0 means 
that there is no monotonic relation between the two variables, but other relationships 
may exist. 

The data and Kendall’s rank order correlation coefficients are shown in Fig. 11.8. 
It was found that values for Kendall’s t were largely insignificant. Only 2 correlations 
(for stimuli sD and sJ) were found to be statistically significant. This result suggests 
that there is no direct influence of immersive tendency on immersion ratings. This 
inference is based on the critical assumption that the scale usage for the rating phase 
and the questionnaire items is identical and that immersive tendency is captured and 
reflected appropriately by the ITQ total score. 


11.5.7 Discussion 


There is a growing interest to study immersion for enhancing audiovisual experiences 
that have been enabled by technologies such as spatial audio and virtual reality. The 
primary challenge in investigating immersion is a lack of suitable methodologies 
for assessing immersion. In this study, we explored a rating experiment inspired 
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Fig. 11.8 Kendall’s rank order correlation between immersion scores and the immersive tendency 
questionnaire’s total score. Kendall’s t was significant for stimuli D and J. Regression lines (in red) 
are plotted only to aid the reader. Participant’s familiarity with the content is denoted by the shape 
of the data points. Adapted from [2] 
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experimental paradigm for the subjective quantification of immersion in audiovisual 
experiences. 

In subjective testing, the instructions provided to the participants are critical since 
the assessors make deliberate judgments based on the provided descriptions. It is 
challenging to communicate the intended idea for cognitive concepts such as immer- 
sion due to a lack of standardized definitions and the inability to demonstrate the 
perceptual differences between stimuli. The results from the experiment show that 
participants were able to comprehend the provided description of immersion and dis- 
tinguish between the different stimuli accordingly. The pairwise comparisons show 
that there were obvious differences between the statistically significant pairs of stim- 
uli even when statistical power is limited. Additionally, the assessors did not report 
issues with understanding the description before or during the experiment. These 
results confirm that participants can reflect on the immersion they experience and 
convey it using a unidimensional scale as suggested by Jennett et al. [31]. 

It is important to understand the nature of immersion (i.e., binary or graded) 
to develop the conceptual understanding of the topic. Qualitative studies [8] and 
theoretical interpretations have conceptualized immersion as a graded concept but 
empirical tests have not been conducted. Results from Hartigan’s dip test show that 
the distribution or immersion is not multimodal; hence suggesting that immersion is 
a graded concept. This is consistent with the conceptual understanding of the topic. 
Immersion being a graded concept implies that direct comparisons can be made 
between experiences and systems on an interval scale. 

A direct influence of immersive tendency ratings on immersion scores could not be 
detected in this study. Only 2 out of the 15 correlations were found to be statistically 
significant. However, it is interesting to note that one of those correlations was nega- 
tive, implying that individuals with higher degree of immersive tendencies found the 
stimulus to be less immersive. We are unable to explain this finding but believe that 
analyzing the contents of the excerpt and the comments provided by the participants 
can be helpful. The correlation of scores is based on the assumption that the partic- 
ipants use the scale in an identical manner for the rating task and the questionnaire 
exercise. Although this is a reasonable assumption, it has not been tested. Addition- 
ally, we assume that the equally weighted sum of scores from the ITQ questionnaire 
reflects immersive tendency accurately. Given the lack of internal consistency [2] 
and the unexplained theoretical grounds for including items on the questionnaire, 
the assumption may be violated. While it is difficult to draw conclusions about the 
ITQ due to the limited number of observations, the questionnaire must be examined, 
compared with other existing questionnaires, and/or new questionnaires should be 
developed to assess immersive tendency. 


11.6 Summary and Future Work 


The primary focus of this chapter has been to present the different perspectives 
on immersion and address the inconsistent and interchangeable usage of the term. 
The conceptualizations of immersion gathered from the literature are categorized 
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and clarified using the filter model. We advocate for a top-down approach to study 
immersion and have synthesized a definition from the psychological standpoint. The 
definition presented below is intentionally broad and application agnostic to aid 
adaptability for different applications. Although it has been used for non-interactive 
application in this chapter, it is applicable to interactive activities as well. 


Immersion is a phenomenon experienced by an individual when they are in a 
state of deep mental involvement in which their cognitive processes (with or 
without sensory stimulation) cause a shift in their attentional state such that 
one may experience disassociation from the awareness of the physical world. 


This definition was used as the foundation for drawing distinctions between 
immersion and commonly confused terms such as envelopment and presence. An 
exploratory experiment was performed by outlining the implications for the exper- 
imental paradigm and appraising the benefits and drawbacks of objective and sub- 
jective measures. A rating experiment inspired paradigm was chosen for evaluating 
immersion. The results for the experiment show that the participants were able to 
discriminate among stimuli even with limited statistical power. This is an important 
result as it demonstrates that the assessors were able to comprehend the task and 
reflect on the overall immersion in an experience. Another important result shows 
immersion is a graded concept which empirically confirms the theoretical conceptu- 
alizations of immersion. 

The motivation to study and evaluate immersion is to improve the experience for 
the users ultimately. A key assumption in the quest to study immersion is that posi- 
tive immersive experiences are preferred by users. It is critical to test this assumption 
before exploring the different avenues for future work. Efforts should focus on val- 
idating and optimizing the experimental paradigm presented in this in addition to 
overcoming the limitations of the current work stated above. Although the method 
was applied in the context of domestic audiovisual experiences, adapting the method 
for virtual and augmented reality applications can be beneficial for optimizing the 
general methodology. Future work should be focused on quantifying the influence 
of the physical characteristics of the audiovisual rendering systems on immersion. 
The results could then be used to improve experiences for the users. For example, 
determining the influence of audio spatialization can be helpful in designing appro- 
priate sound systems for enabling immersive experiences. The filter model described 
in Sect. 11.3 is particularly useful for establishing relationships between the phys- 
ical and the cognitive domains. Inspiration can be drawn from descriptive analysis 
techniques such as free elicitation [6] and open profiling of quality [65] to determine 
the key attributes of immersive experiences and acquire knowledge about the central 
ideas of immersion from the user’s perspective. 
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Chapter 12 A) 
Augmenting Sonic Experiences Through sci 
Haptic Feedback 


Federico Fontana, Hanna Järveläinen, and Stefano Papetti 


Abstract Sonic experiences are usually considered as the result of auditory feed- 
back alone. From a psychological standpoint, however, this is true only when a 
listener is kept isolated from concurrent stimuli targeting the other senses. Such 
stimuli, in fact, may either interfere with the sonic experience if they distract the 
listener, or conversely enhance it if they convey sensations coherent with what is 
being heard. This chapter is concerned with haptic augmentations having effects 
on auditory perception, for example how different vibrotactile cues provided by an 
electronic musical instrument may affect its perceived sound quality or the playing 
experience. Results from different experiments are reviewed showing that the audi- 
tory and somatosensory channels together can produce constructive effects resulting 
in measurable perceptual enhancement. That may affect sonic dimensions ranging 
from basic auditory parameters, such as the perceived intensity of frequency com- 
ponents, up to more complex perceptions which contribute to forming our ecology 
of everyday or musical sounds. 
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12.1 Introduction 


During a sonic experience, humans give meaning to what is being listened to, based 
on their perception and cognition of the auditory scene. As other sensory channels 
normally convey stimuli in parallel to hearing, the human brain integrates a contin- 
uous flow of sensations while contextualizing the experience. If, on the one hand, 
vision, smell, and taste concur in describing an auditory scene, thanks to high-level 
connections involving our mental imagery [14], on the other hand, touch is often 
exposed to temporal patterns that exhibit a strong affinity with the acoustic signals 
hitting the eardrum with respect to their synchronism, amplitude, spectral content, 
and mutual localization. This similarity is evident, for instance, when a musician 
plays an instrument, and more in general whenever a human action generates an 
event producing sound as a (by-)product. 

Our chapter is about whether the somatosensory feedback consequence of that 
action contributes to augment the sonic experience. Here, the term augmentation 
embraces all sorts of enrichment that a sonic experience would benefit from through 
the somatosensory channel, whether it makes a perceived sound stronger, clearer, 
more vivid, meaningful, pleasant, or ecologically valid. Such a variety of effects, 
affecting sound ranging from fundamental physical dimensions until its semantics, 
can be explained by the tight interactions that sound and vibration establish with one 
another, as soon as our brain associates them both with a unique event. Understanding 
such interactions and their effects is the main goal of scientists who investigate the 
psychophysics of auditory-tactile perception. 

Perception psychologists were able to isolate the role of touch, especially dur- 
ing passive auditory tasks. Such tasks in fact lead to generally more robust design, 
control, and repeatability of the experiments. For this reason, the reference literature 
introducing this chapter deals mainly with passive touch. However, the most interest- 
ing sonic augmentations in an ecological or musical sense involve perception-action 
loops, in which the listener physically interacts with a sounding object. In the case 
of active exploration, or when a device reproduces tactile cues, the sense of touch 
conveys haptic feedback. Accordingly, our chapter will focus on effects reported by 
active listeners, as well as on sonic (either ecologic or musical) experiences resulting 
from passive tasks in the presence of various haptic interfaces. 


12.1.1 Multisensory Processing of Touch and Audition 


Multisensory processing—the convergence of information from various sensory 
channels—happens both in early cortical stages and in high-level structures. These 
processes can either enhance or depress response relative to the most robust unisen- 
sory information. This multisensory integration benefits feature integration, object 
processing, event detection, and decision-making especially when cues are weak 
or ambiguous [16, 54] (please refer also to Chap. 10 for a bigger picture on this 
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topic). There is ample evidence of integration and interaction between the senses of 
hearing and touch. While somatosensory influence on higher auditory structures is 
well-known, evidence of low-level influence is more recent and increasing [30]. The 
cochlear nucleus in the brainstem responds to both somatosensory and auditory stim- 
ulation; this way somatosensory input may influence both sound lateralization and the 
suppression of self-generated sounds [10, 53]. The first cortical stages—previously 
thought to process unimodal sensory information—are now known to converge and 
sometimes process heteromodal information. The primary and the belt areas of the 
auditory cortex receive inputs from various low-level somatosensory areas, while 
fewer reports point to pathways from auditory to somatosensory areas. Higher-level 
multisensory areas that process auditory and somatosensory information include the 
Superior temporal cortex and the Insular cortex [15]. 

However, much of the multisensory integration that is necessary for the identifica- 
tion and localization of events takes place in the Superior Colliculus (SC), which is 
located in the midbrain: several subcortical and primary cortical areas project audi- 
tory, somatosensory, and visual information to this area. The neurons in the SC can 
respond differently to cross-modal stimuli than to either of the respective unimodal 
stimuli. Information is integrated according to a few general principles: spatially 
and temporally coherent stimuli produce maximal enhancement, and weaker stimuli 
produce a relatively greater enhancement (inverse effectiveness) [47]. 

Similar observations have been made on behavioral level: sounds and vibrations 
have been shown to interact constructively when congruent stimuli are delivered 
simultaneously [56, 57], with measurable auditory effects of somatosensory feed- 
back [4, 36, 37, 39, 51, 52, 61]. Here congruence is defined depending on the 
experimental procedure: in general, it refers to conditions in which the multisensory 
stimulus shares common spatio-temporal as well as spectral features, as if it was 
originating from a unique source producing sounds and vibrations together. In paral- 
lel, simultaneity refers to a stimulus pair whose acoustic and vibratory components 
are rigorously constrained concerning their mutual synchronization: audio-tactile 
temporal resolution is superior to audio-visual or visuo-tactile combinations [20]. In 
this regard, it must be kept in mind that hearing and touch are both very sensitive 
to temporal delays, and detect especially low latency values relative to each other. 
By varying these values in the range 5-70 ms, Kaaresoja et al. have been able to 
change the perceived quality of virtual buttons during a clicking gesture [29]. More 
in general, mutual unsynchronization and/or delocalization of the acoustic and vibra- 
tory components leads to disparate effects that must be dealt with case by case [50], 
revealing the complexity of audio-tactile interactions. As this chapter focuses on hap- 
tic feedback, we will instead describe experiments where stimuli are simultaneous 
and co-localized. 

Spatial collocation seems in fact somewhat less critical than temporal synchrony, 
judging by the presence of audio-tactile interactions and enhancement in many exper- 
iments where participants receive vibrotactile feedback through the hand and auditory 
stimuli through headphones [31]. Nevertheless, humans have good spatial discrim- 
ination ability between auditory and tactile stimuli: lateral angles of >5.3° were 
detected between electrotactile stimulation at the fingertip and sound source in an 
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experiment by Altinsoy [2]. (To put this in context, auditory localization blur for the 
scraping sounds used as stimuli in the experiment was 3.9°.) Indeed, the seeming fail- 
ure of some studies to demonstrate spatial modulations of audio-tactile interactions 
may be due to the fact that stimuli have been presented at hands or otherwise at some 
distance from the head; more recently, spatial modulation effects have indeed been 
observed especially in the space close to the head [31]. However, these phenomena 
are not thoroughly known yet; note that in the peripersonal space, even unimodal 
auditory localization differs from that at greater distances [6-8]. 

The psychophysical literature specifically dealing with the effects of touch on 
auditory perception is sparse, mostly focusing on intensity and pitch as primary 
objects of investigation. As opposed to the previously described constructive effect 
valid for multisensory cues of intensity, the interactions between auditory pitch 
and tactile frequency discrimination are more complex [5, 59]. In particular, tac- 
tile frequencies do not need simultaneity nor co-localization to affect pitch percep- 
tion [62]. As part of their study on the audio-tactile pitch and loudness interactions, 
Yau et al. found separate mechanisms for tactile influence on loudness and pitch, with 
audio-tactile loudness perception depending more on the timing of the stimuli [60]. 
Anyhow, pitch is perceived much more accurately through the hearing system, hence 
touch in general plays no supportive role during the perception of frequency compo- 
nents in an audio-tactile signal. Still, tactile frequency discrimination ability has been 
ascertained [23, 55], with surprising accuracy in congenitally deaf individuals [35]. 
This evidence naturally leads to the question about musical sensations induced by 
touch, an issue which has fascinated several scientists [49] and, hence, occupies an 
important part of this chapter. 

Some deaf musicians show an indisputable ability to “feel the vibrations” during 
music performance, not merely for entraining with other musicians [12, 22, 25, 
26], but also for sharing melody and timbre with them. This ability seems to be the 
result of the long training any (i.e., including the normally able) good musician has 
accumulated with all senses on their instrument [34] during a continuous perception- 
action process. Such a training, hence, refines a multisensory acuity for the instrument 
quality, not limited to its sound [21, 58]. 

Non-musicians can also discriminate musical timbre and relative pitch intervals 
from vibrotactile cues, to some extent even without training [24, 49]. However, 
generalizing the above-mentioned higher-level phenomena to musically untrained 
individuals is not obvious [3]. Being inherently psychophysical, there is no reason 
to think that the summation of auditory and tactile cues of intensity would not apply 
to non-musicians. In parallel, musical training seems to facilitate more subtle audio- 
tactile synergies mediated by higher nervous system levels, such as those linking pitch 
and tactile frequency recognition [11, 33]. Amid these two facts, the possibility for 
touch to enable the detection in normal listeners of frequency components otherwise 
inaudible, due to masking or threshold effects, is yet to be systematically explored. 
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Table 12.1 Key characteristics of the experiments forming the chapter 
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Experiment Conditions | Input gesture Haptic feedback | Audio feedback | Results 
Ball bouncing on | Passive Finger-pad Recorded stimuli Recorded stimuli Audio-tactile 
everyday contact summation leads 
materials [13] to best 
identification 
Reproduction of | Interactive, | Finger pressing | Sine wave Sine wave Audio-tactile 
target pressing musical summation offers 
forces [27] best performance 
Perception of Passive, Hand-instrument | Musical scale Masking noise Differences 
musical scales musical contact between scales 
[48] are discriminated 
Perception of Interactive, | String plucking | Guitar string Guitar string Haptic feedback 
plucked strings musical does not increase 
[44] sound quality 
Piano playing Interactive | Free playing Piano vibration Piano Natural vibrations 
[18] musical on and off alter perceived 
piano quality 
Digital piano Interactive, | Free playing Piano vibration, | Piano Vibrations alter 
playing [19] musical filtered noise perceived 
keyboard quality 
Playing Interactive, | Free playing None, sine wave, | Expressive/dull | Vibrations 
experience on a musical filtered audio, coherent with 
haptic surface for band-passed audio feedback 
musical noise improve 
expression [41] perceived 
interface 
quality/playing 
experience 


12.1.2 Chapter Outline 


In their respective interaction contexts and with different confidence levels, hence, 
the experiments chosen for this chapter share the general assumption that a sonic 
experience can be influenced by somatosensory cues. Some of them (e.g., [19, 48]) 
contributed to give form to the musical haptics research methodology and, hence, 
led to inevitably less robust conclusions. For this reason, they are certainly more 
suggestive than conclusive. 

In an aim to orient the reader to the experiments which reflect his or her interests, 
Table 12.1 summarizes their key characteristics. Moreover, the table labels the exper- 
iments with gray tones classifying their dependence on specific elements. According 
to this classification, the first two experiments define an abstract context which is in 
principle applicable to multiple interaction contexts. The third and fourth experiments 
limit these contexts respectively to musical scales and plucked strings perception. 
The fifth and sixth ones further restrict the context respectively to acoustic and digital 
pianos. Finally, the seventh experiment specifically targets haptic versions of sound 
wave templates. 
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More in detail, the first experiment suggests a role of tactile frequency discrimina- 
tion in enhancing the auditory perception of near-threshold frequency components; 
this role emerged during the audio-tactile identification of everyday materials from 
their response to a ball hitting them [13]. Next, we present an experiment conducted 
using an audio-tactile interface [40], showing that individuals performing a basic 
musical gesture such as finger pressing were able to reproduce previously learnt tar- 
get forces more accurately if receiving contextual audio-tactile feedback instead of 
auditory or tactile feedback alone [27]. 

The third and fourth experiments link the aforementioned effects to musical expe- 
riences. As evidence of the power of the vibrotactile channel to deliver musical infor- 
mation, we first review a test in which Western and Indian musicians categorized and 
even identified music scales from both traditions by touching the surface of a harmo- 
nium [48]. Then, a robotic stringed instrument prototype called Keytar is described, 
in which the accurate haptic rendering of its virtual strings was significantly appre- 
ciated by users, however with no significant improvements for the perceived sound 
quality [44]. 

Conversely, a constructive effect was measured in pianists playing an acoustic 
piano whose natural vibrations could be switched on and off, thanks to peculiar engi- 
neering of the keyboard: in this case, the inclusion of vibrotactile feedback resulted 
in a measurable improvement of the instrument sound quality [18]. A similar effect 
was measured in musicians playing an actuated digital piano when this instrument 
reproduced vibrations recorded on a real piano [19]. 

Finally, using a force-sensitive haptic surface for musical expression which con- 
trolled a synthesizer, the effect of various vibration types on perceived quality 
attributes and the playing experience was assessed [41]. 


12.2 Ball Bouncing on Everyday Materials 


Two experiments [13] studied the role of impact sounds and vibrations for the sub- 
jective classification of three flat objects, which were respectively made of wood, 
plastic, and metal—see Fig. 12.1. 

The task consisted of feeling an actuated surface and listening through headphones 
to the recorded feedback of a ping-pong ball hitting such objects (Fig. 12.2, left), 
after they had been experienced during a training task (Fig. 12.2, right). 

In Experiment 1, sounds and vibrations were recorded by keeping the objects in 
mechanical isolation. In Experiment 2, recordings were taken while the same objects 
stood on a table, causing their resonances to fade faster due to mechanical coupling 
with the support. Twenty-five subjects, aged between 23 and 61 years (M = 32.1, 
SD = 10.1), participated in Experiment 1, and twenty-seven (21-54 years old; M = 
29.0, SD = 6.8) in Experiment 2. Eight subjects participated in both experiments. 
Roughly one-third of the participants were female. In terms of musical training, 
participants were not screened, and they reflect the general population average. 
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Fig. 12.1 Materials used in the experiment. Left: wood. Center: plastic. Right: metal 


Fig. 12.2 Experimental tasks. Left: perceptual task. Right: training task 
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Fig. 12.3 Boxplot of and mean proportions correct with SE bars for all condition combinations. 
Left: Experiment 1. Right: Experiment 2 


As a general result, in both experiments tactile identification was less accurate 
than auditory identification. In parallel, the bimodal (i.e., simultaneously auditory 
and tactile) identification ranked significantly better in both experiments, providing 
evidence of support from touch to auditory material identification (Fig. 12.3). 

This conclusion was not contradicted by a control experiment, in which partici- 
pants were asked to identify the materials from real bounces as during the training 
shown in Fig. 12.2, right. 
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Between Experiments 1 and 2, some interesting differences are observed between 
materials. In Experiment 1, metal was identified from auditory cues almost perfectly 
(difference between both plastic and wood was significant in multiple comparisons 
following a significant Friedman test: AuditoryWood-Auditory Metal: 
Z = 4.3, Bonferroni-corrected p < 0.01; AuditoryPlastic-AuditoryMetal: Z = 3.4, 
p < .01). In contrast, in Experiment 2, the identification of metal was the poorest 
of the three materials. In a two-way repeated-measures ANOVA with Greenhouse- 
Geisser correction for insphericity, a significant main effect of Material was detected 
(F(1.61,41.9) = 16.3, p < 0.001). The 95% confidence intervals of the three mate- 
rials result in a partial overlap between Plastic (0.51—0.64) and Metal (0.42-0.57), 
whereas the 95% CI for Wood is entirely above their combined range (0.65-0.78). As 
the main difference between the stimuli in Experiments | and 2 was the length of the 
decay, it seems that the longer decay in Experiment | was an important identification 
cue, especially for metal. 

Importantly for this chapter, the ability of our subjects to maximize their identifi- 
cation accuracy when using sounds and vibrations together suggests that audio-tactile 
summation may work in all individuals as soon as they have acquired a solid knowl- 
edge about a multisensory event belonging to the everyday experience, and not only 
if they have accumulated peculiar audio-tactile skills, e.g., by practicing for a long 
time with a musical instrument. This conclusion was reinforced by a further test, part 
of the same research, where incongruent bimodal stimuli were prepared by assem- 
bling sounds and vibrations reporting respectively on two different materials. This 
test in fact suggested that tactile feedback, in its limited possibility to convey timbre, 
became progressively more relevant as the auditory channel, in front of incongruent 
materials, left its leading role while remaining supportive of cross-modal perception. 


12.3 Reproduction of Target Pressing Forces 


An effect of haptic feedback on the control of finger-pressing force has been shown 
in the literature (e.g., [1, 28]). The present setup [27] approaches a musical task 
in that it measures memorized force targets in the presence of both auditory and 
vibrotactile feedback. The experiment was carried out by means of a tabletop device 
capable of measuring normal force while displaying vibrotactile feedback at its top 
panel (Fig. 12.4). 

To simulate the haptic exchange taking place when playing acoustic or electroa- 
coustic instruments—where musicians would learn the response of the instrument 
and would then perform by relying on kinesthetic memory [38]—participants first 
learned three target forces during a training phase, without additional feedback. Those 
targets were chosen empirically according to low, medium, and high pressing forces, 
within the data resolution of the interface (10-bit, corresponding to the 0-1023 range) 
and without anchoring them to corresponding values in Newton: the low target was 
set to 400, the medium one to 650, and the high target to 850. A double-sided window 
of 50 units was considered around each target as the acceptance range. The task was 
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Fig. 12.4 The interface used 
in the experiment for 
recording finger-pressing 
force and providing 
vibrotactile feedback 


then to reproduce such forces “out of memory” under four feedback conditions: no 
feedback (N), auditory only (A), vibrotactile only (T), and auditory and vibrotactile 
together (AT). When participants believed they had reached the asked target they had 
to press an “OK” button with their free hand, while maintaining the pressing force 
on the touch panel. 

For the sake of simplicity, a sinusoidal signal was chosen for rendering both 
auditory and tactile feedback, whose amplitude varied proportionally to the applied 
pressing force—thus implementing a gesture mapping commonly found in musical 
practice. The maximum intensity of vibrotactile stimuli was empirically set to the 
highest level that could be reproduced without perceivable distortion. The frequency 
of the sine wave was set to 200 Hz so as to maximize the produced vibrotactile 
sensation [55]. 

The test followed a 2-factor within-subjects design, where each participant was 
tested under each combination of conditions (12). All combinations were repeated 
10 times, resulting in 120 trials that were presented in randomized order. Fourteen 
people (average age 33) participated in the experiment: five of them were pianists, 
five other musicians, and four non-musicians.! 

Data analysis” showed a significant main effect of feedback factor (F(3,143) = 
16, p < 0.0001). The effect of target force level was not significant (F(2,143) = 0.7, 
p = 0.52); however, the interaction “feedback x target level” was significant 
(F(6,143) = 6.0, p < 0.0001). 

The interaction plots in Fig. 12.5 show that, for the low target force, mean errors 
are much smaller in the presence of auditory (A) or audio-tactile (AT) feedback, and 
somewhat smaller with tactile-only feedback (T) than with no-feedback (N). For the 


l The relatively low participation number of musicians as well as non-musicians reflects the 
exploratory character of tactile experiments with pianists as far as one decade ago. Later, they 
have consolidated into more robust methodologies, including the participation of more musicians 
when necessary—see, e.g., Sect. 12.6. 


2 performed by aligned rank transform, the nonparametric equivalent to factorial within-subjects 
analysis of variance 
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Fig. 12.5 Interaction plots. Top panel: mean relative errors at the three target forces, presented 
for each feedback condition. Bottom panel: mean relative errors at the four feedback conditions, 
presented for each target force level 


medium target force, mean errors decrease in case of no-feedback (showing that the 
task becomes increasingly easier for higher forces) and with tactile-only feedback 
(T), whereas with auditory or audio-tactile feedback (A, AT) they did not change 
much from the low target force. For the high target force, however, the results are 
almost equivalent at all feedback conditions. 

The results generally show that the addition of vibrations to auditory feedback may 
improve performance in musical finger-pressing tasks, enabling subjects to achieve 
memorized target forces with higher accuracy. 
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12.4 Vibrotactile Recognition of Traditional Musical Scales 


The harmonium, visible in Fig. 12.6 (left), is played in both Western and Oriental 
music using scales that belong to the respective tradition. Musicians and also listeners 
with a normal understanding of music immediately recognize the ethnicity of a scale. 
In fact, the human ear is especially accurate in assessing the intervals existing between 
the fundamental frequencies of musical notes. 

Does a haptic counterpart of scale recognition ability exist, result of a tactile 
frequency identification process musicians have internalized as part of their practice 
on an instrument? And, if recognition does not occur, would they be able to at 
least discriminate between different ethnicities? If either answer was positive, then 
musical vibrations would prove to be active carriers of spectral information capable 
of supporting, or even substituting, an especially important component of the musical 
message coming from an instrument. 

Western and Indian notes have fundamental frequencies that in general do not 
match; furthermore, such intervals between notes differ depending on the scale. As 
a result, clearly audible discrepancies exist between Western and Indian musical 
scales, and then between different scales belonging to the same ethnicity. 

The stimuli for the experiment [48] consisted of two Western (C natural and 
A minor) and two Indian (Raag Bhairav and Raag Yaman-Kalyan) scales played 
on the harmonium in the setup of Fig. 12.6 (right) by an Indian performer living 
in Europe. After listening to the four scales without touching the instrument during 
a training session, participants in a tactile recognition test were sitting on the left 
side of the same setup with their hands on the harmonium. At every trial, they were 
exposed to a train of vibrations corresponding to the sequence of notes belonging 
to a scale played by the performer. At the end of it, they had to decide whether the 
vibration was reporting about a Western or Indian scale, and to which one of the two. 


BELLOWS 
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BODY 


STOPS (main) 
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Fig. 12.6 Left: the harmonium. Right: experimental setup 
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Table 12.2 Individual subjective performance 


Subject typology Western participants Indian participants 
Recognition of | Recognition of | Recognition of | Recognition of 
tradition scale tradition scale 

A—teacher of music 7/16 4/16 12/16 12/16 

B—teacher of music 11/16 9/16 13/16 11/16 

C—amateur musician 15/16 12/16 13/16 8/16 

D—professional musician | 16/16 12/16 10/16 7/16 

E—professional musician | 13/16 11/16 10/16 5/16 

Overall recognition 62/80 48/80 58/80 43/80 

Overall percentage 71.5% 60% 72.5% 53.75% 

Chance percentage 50% 25% 50% 25% 


During the test, they neither wore headphones emitting masking noise nor could they 
observe the playing action, thanks to a panel standing amid the harmonium body, 
avoiding the performer and participant from seeing each other. 

The test was performed by a native group of Italians and then repeated in India. 
The two groups of participants, identical in number, were selected so as to have com- 
parable levels of musical knowledge and performing skills. Results are listed in Table 
12.2: They reveal the ability of both groups to recognize the ethnic origin with no 
significant differences between groups. Limited to specific subgroups, i.e., Western 
performers and Indian music teachers, the specific scale was recognized as well. The 
surprisingly high performance shown by our participants suggests the existence of a 
well-developed tactile memory for tones and/or note scales in musicians, a possible 
result of musical instrument training. However, the support during the task of nearly 
masked auditory cues of pitch bypassing the headphone insulators, or traveling from 
the hands to the cochlea through bone conduction, in principle could not be excluded. 
Similarly, scale-dependent temporal nuances biasing the recognition of the stimuli 
might have been unconsciously introduced by the performer during playing. In spite 
of its limited control, this experiment nevertheless represented an interesting starting 
point for the study of the role of touch in musical scale recognition. 


12.5 Perception of Plucked Strings 


Keytar is a plucked-string instrument interface [17]. Its software was developed 
within the Unity3D development engine. While running on a PC, Keytar provides 
real-time auditory, visual, and haptic feedback to the player who controls a virtual 
plectrum through a Phantom Omni robotic arm with one hand, while selecting notes 
and chords with the other hand (see Fig. 12.7, left). An accurate haptic rendering 
of the interaction point was made possible by modeling each string as a queue of 


12 Augmenting Sonic Experiences Through Haptic Feedback 365 


Fig. 12.7 Left: Keytar. Right: particular of the plectrum-string interaction point 


short cylinders with alternating radius, and then by characterizing the contact of 
the plectrum using physical parameters which, due to the elastic behavior of the 
string, fall within the operating range of the Phantom Omni (see Fig. 12.7, right). 
This way, the robotic arm not only reproduces the elastic response of the plucked 
strings, but also some fine-grained dynamic textures arising between the colliding 
plectrum and the vibrating string. The sensation of rubbing the string during plucking 
is further enhanced by a realistic noise of frictional contacts coming from the servo- 
mechanisms of the robotic arm, while they are continuously switched on and off by 
the collision detection software module. The overall virtual environment defined an 
especially convincing reproduction of string plucking [45]. 

In a virtual reality experiment [44], twenty-nine participants on average having 
8.2 years (SD = 8.3) of regular practice on a music instrument were asked to first 
pluck the strings of a real guitar, and then to wear an Oculus Rift CV1 helmet 
displaying an electric guitar and a plectrum in a nondescript virtual room. Twenty- 
one such participants in particular reported being able to play one or more stringed 
instruments. Interaction with the plectrum was made possible using the robotic arm 
controlled by Keytar, furthermore, the collision detection module controlled also a 
vibro-tactile actuator standing below Phantom Omni. This active stand was used to 
produce additional vibrations independently of the kinesthetic feedback. On such 
a setup, a within-subjects study compared four different haptic conditions during 
plucking: no feedback (N), force only (F), vibration only (V), and force and vibration 
together (FV). 

Each participant was exposed to every condition, in randomized order, for approx- 
imately 20 minutes each. On every condition, first all six strings were plucked twice 
in a randomized order by the guidance of a visual marker emphasizing the string to 
pluck; then, participants were encouraged to freely interact by both plucking each 
string individually and strumming the entire string set. When one condition was com- 
pletely tested, each participant evaluated four metrics on a Likert scale (see Fig. 12.8): 
overall perceptual similarity with the real instrument (from completely different to 
identical); stiffness similarity between virtual and real strings (from much lower to 
much higher); overall realism of the virtual instrument (from strong disagreement to 
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Fig. 12.8 Keytar: experimental results 
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(b) Stiffness relative to physical strings 
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strong agreement); touch realism of the virtual strings (from strong disagreement to 
strong agreement); effects of haptic cues on sound realism. At the end of the test each 
participant was additionally asked to choose his/her preferred condition. Finally, the 
errors made on plucking a wrong instead of a visually marked string during the part 
of the test involving individual strings were logged. 

Results suggest the existence of significant effects of haptic feedback on the 
perceived realism of the strings. Further considerations can be drawn from the specific 
histograms [46]. By contrast, as can be seen from the left histogram below in Fig. 12.8, 
no effects on sound realism were measured. The lesson to take home from this 
experiment, hence, is that increasing the haptic realism of a virtual musical instrument 
in principle has no effects on its perceived auditory quality. 


12 Augmenting Sonic Experiences Through Haptic Feedback 367 


12.6 Piano Playing 


A different lesson was instead learnt from an experiment in which the realism of 
the interaction with the musical instrument, in this case a piano, was pushed to its 
limit [18]. The piano keyboard in fact offers a controlled experimental setting, as 
the performer can only hit and then release one or more keys with one or more 
fingers while the rest of their body is disconnected from the instrument. This setting 
permitted to design a task in which auditory and haptic feedback could be delivered 
separately and independently. Furthermore, the intensity of both feedback channels is 
a reliable function of the key velocity which, in turn, is driven by the pianist’s finger. 
Under these experimental premises, Yamaha’s Disklavier pianos in particular offer 
two specific advantages: first, they can both record and mechanically reproduce the 
action of a pianist on all keys; secondly, they can be automatically switched between 
normal operation and a silent mode. When this mode is set, all strings are decoupled 
from the respective key hammers in ways that the instrument produces no sound, 
meanwhile conveying the same haptic feedback as to when the performer also hears 
the instrument. 

The group of participants was split into two independent subgroups. Either sub- 
group performed on a grand Disklavier model DC3 M4 (in Padova, Italy) or on an 
upright model DUIA (in Zurich, Switzerland). During the tasks, the acoustic and 
silent modes were randomly switched across trials, letting the participants receive 
either natural or no steady vibrations from the keys after the initial percussive event. 
In both configurations participants via insulated headphones received the same audi- 
tory feedback, consisting of piano sounds synthesized by Modartt Pianoteq 4.5 dig- 
ital piano software which was set to simulate a grand or an upright piano, and was 
driven in real time by the respective Disklavier’s Musical Instrument Digital Interface 
(MIDI). The synthetic sounds were equalized so as to match those of the correspond- 
ing piano, by positioning a KEMAR mannequin visible in Fig. 12.9 (left), where the 
setup is shown during the calibration procedure. Figure 12.9 (right) shows a typi- 
cal train of vibrations reaching the pianist’s finger when the piano was operating in 
acoustic mode: the initial percussion event preceding the vibrations coming from the 
strings is evident in this figure. 

Participants performed first a playing task and then a rating task. The former is 
relevant for this chapter. Three note ranges were considered separately across the 
keyboard, labeled low (keys below D3), mid (keys between D3 and A5), and high 
(keys above A5). Participants could play freely, within one range at a time, to compare 
the quality of the instrument in the presence and absence of string vibrations following 
the initial percussive events. Twenty-five professional pianists, mostly classical and 
a few jazz, took part in the tests: 15 on the upright and 10 on the grand piano (the 
slight imbalance in group sizes was due to varying easiness of recruitment in the two 
locations). Their average age was 27 years and their average piano experience was 
15 years. Using a manual control, they could switch at their convenience between 
two setups, X and Y, associated with the silent and acoustic modes of the Disklavier. 
The difference between the two setups was not explained to them. 
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Fig. 12.9 Left: setup calibration. Right: acceleration signal measured on the key surface (note A2; 
MIDI velocity equal to 12; grand piano) 
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The task was to compare the setups on a Likert scale (from “X much better than Y” 
until “Y much better than X”) with respect to the following attributes: dynamic range, 
loudness, richness, naturalness, and preference. The first four were rated separately 
in the low, mid, and high ranges, while the preference rating was given considering 
the entire keyboard. Participants were given definitions of the attributes and informed 
that dynamic range, loudness, and richness were mainly related to sound, whereas 
naturalness and preference could also be related to touch. A laptop finding place next 
to the piano displayed a set of sliders that were accessible at any moment to pianists 
for rating such attributes. 

Results are shown in Figs. 12.10 and 12.11, suggesting a general preference for the 
vibrating mode. Since this preference was not explicitly linked to a specific attribute, 
two principal components, PC1 and PC2, were discovered to account for 80% of the 
variance. PC1 had the highest positive correlations with richness, naturalness, and 
preference; PC2, less powerful, was associated with dynamic range and loudness, 
which conversely decrease as naturalness and preference increase. 
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Analysis of Lin concordance correlation coefficients revealed a subgroup of seven 
subjects whose inter-individual consistency was negative. It was observed that most 
of them belonged to the group of five subjects who gave a negative preference rat- 
ing. Therefore, participants were segmented a posteriori based on a positive versus 
negative preference rating. As seen in Fig. 12.11, the negative group differs from 
the majority of participants in that their ratings are negative on both principal com- 
ponents; in fact, while both groups gave rather similar ratings for dynamic range 
and loudness, their mean ratings for richness, naturalness, and preference are nearly 
opposite to each other. The conclusion was that approximately 80% of the partic- 
ipants preferred the vibrating setup and perceived higher naturalness and richness 
from it. Why the remaining 20% did not perceive any benefits from vibrations could 
not be thoroughly explained; however, in that group were two subjects who per- 
formed significantly under average in a vibration detection experiment related to this 
study. Notably, the negative group also included some jazz pianists. They reported 
performing frequently in small ensembles where digital stage pianos are used, which 
lack the natural vibrotactile feedback found on acoustic pianos. 

At any rate, after completing the test in Zurich, the experimenter asked each 
participant what may have caused the difference between the setups: Interestingly, 
only 1 out of 15 participants could pinpoint vibrations. Thus, while the participants 
generally preferred the vibrating setup, they were not actively aware of vibrations. 
Their unawareness testifies to the especially high level of cross-modal integration 
that piano sounds and vibrations achieve in a real instrument. 


12.7 Digital Piano Playing 


An effect related to what was observed on acoustic pianos was discovered to play 
a role with digital pianos [19]. Since electronic instruments do not vibrate except 
for possible mechanical perturbations coming from the internal speakers, potential 
additional effects of artificial vibratory feedback to perceived instrument quality, 
precision in timing, and dynamic performance were investigated. The setup definition 
required to disassemble a digital piano keyboard, and then attach two vibrotactile 
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Fig. 12.12 Left: experimental setup. Right: transducer conveying vibrations to the keyboard 


actuators (Fig. 12.12, right) on a stiff wooden panel which was firmly screwed below 
its keybed (Fig. 12.12, left). 

These actuators conveyed stimuli that had previously been acquired from an acous- 
tic piano. In parallel, binaurally recorded tones were reproduced using headphones. 
Such tones and vibrations had previously been calibrated to have an intensity equal 
to that measured on the finger and ears of a pianist performing on a Disklavier grand 
piano, in the same fashion as the experiment in Sect. 12.6. In particular, calibration 
is required to equalize the vibration signals in order to avoid unrealistic resonance 
peaks on the digital keyboard for certain played notes. 

Eleven pianists, five females and six males, participated in the experiment. Their 
average age was 26 years, and their average piano playing experience was 8 years 
after reaching the conservatory level. Two participants were jazz pianists. Audio- 
tactile stimuli were produced at runtime: the digital keyboard in fact sent MIDI mes- 
sages to a computer running Modartt Pianoteq 4.5 piano synthesizer and, in parallel, 
Native Instruments Kontakt 5 sampler in series with MeldaProduction MEqualizer 
parametric equalizer for playing back the corresponding vibration samples. 

Perceived instrument quality was assessed by feeding the digital keyboard respec- 
tively with (A) no vibrations, (B) grand piano vibrations, (C) grand piano vibrations 
with 9 dB boost, and (D) synthetic vibrations. By contrast, the sound synthesis 
parameters were kept constant throughout the experiment. Pianists were asked to 
play freely while assessing the experience on five attribute rating scales: Dynamic 
control, Richness, Engagement, Naturalness, and General preference. During play- 
ing, at their convenience they could switch among two unknown setups, œ and £: 
the former was always made to correspond to A, whereas the latter could randomly 
correspond to B, C, or D. The assessment was conducted by rating £ relatively to 
a during 10 minutes of piano performance, for a session that hence lasted half an 
hour. During each assessment, participants at any time could rate every attribute by 
pointing to the respective virtual slider and setting a level by clicking with the mouse 
on a graphical user interface that was displayed by a laptop computer at hand reach. 
Each slider exposed a continuous Comparison Category Rating scale ranging from 
—3 (“B much better than œ”) to +3 (“6 much worse than a”). Once the quality rating 
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of the keyboard was over, another half an hour was spent by each participant to par- 
ticipate in the remaining two tests, assessing precision in timing as well as dynamic 
performance. 

Results show that the augmented setups were generally preferred, with an empha- 
sis on boosted vibrations (Fig. 12.13). Again, heterogeneity was observed in the 
data, as might be expected due to the high degree of variability in the inter-individual 
agreement scores. A k-means clustering algorithm was used to segment the sub- 
jects a posteriori into two classes, according to their opinion on General preference. 
Eight subjects were classified into a “positive” group and the remaining three into a 
“negative” group. The results of the respective groups are presented in Fig. 12.14. 
A difference of opinion is evident: The median ratings for the preferred setup C are 
nearly +2 in the positive group and —1.5 in the negative group for General preference. 
In the positive group, the median was positive in all cases except for Naturalness in 
D, whereas in the negative group, the median was positive only for Dynamic control 
in B. 

Similar to what was observed in Sect. 12.6 while experimenting with the acoustic 
piano, low concordance between pianists exposed to vibration suggests that intra- 
and inter-individual consistency is an issue also while playing a digital piano. By 
contrast, no effect was observed on timing or dynamics accuracy in the perfor- 
mance tests. Taken together, these considerations point to conclude that vibrations do 
unconsciously influence the perceived keyboard instrument quality, however, along 
a direction which depends on the performer’s previous multisensory experience of 
a specific instrument. Hence, augmenting a digital piano with the vibrations of an 
acoustic piano might not increase sense of quality if the performer played a digital 
(i.e., non vibrating) keyboard for most of the time. In parallel, haptic augmentation 
neither improves nor disrupts key aspects of piano performance such as timing and 
dynamic control. 
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Fig. 12.14 Differences in quality ratings between the positive (left) and negative (right) groups 
formed by a posteriori segmentation. Boxplot presenting median and quartile for each attribute scale 
and vibration condition 


12.8 Playing Experience on a Haptic Surface for Musical 
Expression 


A multi-touch force-sensitive surface for musical expression was equipped with 
multi-point localized vibrotactile feedback, resulting in the HSoundplane haptic 
interface [43] shown in Fig. 12.15. A subjective assessment was conducted using 
the HSoundplane, which measured how the presence and type of vibration affect the 
perceived quality of the device, as well as various attributes related to the playing 
experience [41]. 


Fig. 12.15 The 
experimental setting for the 
HSoundplane experiment 
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12.8.1 Design 


Two clearly distinct sound presets were tested, each with three vibrotactile feedback 
strategies. 

The pitch of the audio feedback—ranging from A2 (fo = 110Hz) to D5 (fo = 
587.33 Hz)—was controlled along the x-axis. The two offered sound presets were 


Sound 1—A sawtooth wave filtered by a resonant low-pass and modulated by a vib- 
rato effect (i.e., amplitude and pitch modulation). A markedly expressive setting, 
responding to subtleties and nuances in the performer’s gesture. 
y-axis control: Vibrato intensity is controlled along the y-axis, from no-vibrato 
(bottom) to strong vibrato (top). 
z-axis control: The filter cutoff frequency is controlled by the applied press- 
ing force (i.e., higher force maps to brighter sound), and so is the sound level 
(i.e., higher force maps to louder sound). 

Sound 2—A simple sine wave is added with noise depending on the location on the 
y-axis. A setting offering a rather limited sonic palette and no amplitude dynamics. 
y-axis control: Moving upwards adds white noise of increasing amplitude, filtered 
by a resonant band-pass. The filter’s center frequency follows the pitch of the 
respective tone. 
z-axis control: Pressing force data are ignored, resulting in fixed intensity. 


The different degrees of variability and expressive potential of the two sound settings 
allowed us to investigate whether the possible effect depends on audio feedback 
characteristics. All sounds were processed by a reverb effect so as to make the 
playing experience more acoustic-like. Sound was provided to the participants by 
means of closed-back headphones (Beyerdynamic DT 770 Pro). Audio examples of 
the two sound types are made available online,’ demonstrating C3, C4, and C5 tones 
modulated along the y- and z-axes. 

Before being routed to the actuators layer, vibration signals were filtered in the 
10—500 Hz range by a 10th-order band-pass, so as to optimize the actuators’ effi- 
ciency and consequently the vibratory response of the device, as well as to minimize 
sound leakage. Any residual sound spillage produced by the actuators was taken 
care of by the closed-back headphones carrying auditory feedback. Three vibrotac- 
tile strategies were implemented: 


Sine—Pure sinusoidal signals, whose pitch follows the fundamental of the played tones 
(fo within 110—587.33 Hz), and whose amplitude is controlled by the intensity 
of the pressing forces. By focusing vibratory energy at a single frequency com- 
ponent, this setting aimed at producing sharp vibrotactile feedback. 

Audio—The same sounds generated by the HSoundplane used to render vibration: the 
audio signals are also routed to the actuators layer. Vibration signals thus share 
the same spectrum (within the 10—500 Hz pass-band) and dynamics of the related 
sound. This approach ensured the highest coherence between musical output and 


3 https://tinyurl.com/HS-sounds. 
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tactile feedback, mimicking what occurs on acoustic musical instruments, where 
the source of vibration coincides with that of sound. 

Noise—A white noise signal of fixed amplitude. This setting produced vibrotactile 
feedback generally uncorrelated with the auditory one, ignoring any spectral and 
amplitude cues possibly conveyed by it. The only exception is with Sound 2 and 
high y-axis values, which resulted in a similar noisy signal. 


The designed vibration types offered different spectral and dynamics cues resulting 
in varying degrees of similarity with the audio feedback, thus enabling to determine 
the importance of the match between sound and vibration. The intensity of vibration 
feedback was set by the authors in a pilot phase, aiming at two main goals: (i) sound 
and vibration intensities had to feel reciprocally consistent; (ii) while levels had to be 
overall comfortable for prolonged use, vibration had to be clearly perceivable even 
at low force-pressing values [42]. 

At each trial, the task was to play freely while comparing two related setups: they 
were labeled A/B in a balanced way, and differed only in the presence/absence of 
vibration (i.e., they shared the same sound setting). Participants could switch at any 
time between A and B and had to provide ratings for four attributes: Preference, 
Control and responsiveness (referred to as Control), Expressive potential (referred 
to as Expression), and Enjoyment. Ratings were given by adjusting a respective slider 
on a continuous visual analog scale ranging from A (left) to B (right) to reflect the 
degree of preference in terms of the given attribute. In case of perceived equality 
between A and B, the slider would be set to the midpoint. All 4 (attributes) x 3 
(vibration types) x 2 (sound types) factor combinations were evaluated twice. 

All 29 participants—7 males and 22 females, aged 18-48 years (M = 25.4, 
SD = 7.1)—were professional musicians or music students. Their main instrument 
was either a keyboard or a string instrument, on which they had on average 17 years 
of experience. Roughly one-third of the participants had significant experience with 
electronic musical instruments, mostly synthesizers, or digital musical interfaces. 


12.8.2 Results 


The continuous slider scale ratings were mapped to the closed interval [0, 1], where 1 
indicates a maximal preference for the vibrating setup and 0 maximal preference for 
the non-vibrating setup, and 0.5 is the point of perceived equality. Statistical analysis 
was carried out by fitting a zero-one-inflated beta (ZOIB) model, whose parameters 
were estimated with Bayesian methods [9, 32]. Four parameters describe the ZOIB 
distribution: the mean (u) and precision (@) of the beta distribution, the probability 
of a binary {0, 1} outcome (zoi), and the conditional probability of outcome {1} 
(coi). The mean of the beta distribution was modeled by sound, vibration type, their 
interaction, and attribute. The models for the precision (¢) and zero-one-inflation 
parameters (zoi, coi) were set to depend on vibration type, sound, and attribute 
without interactions. 
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Estimates for the beta distribution means and their corresponding 95% Credible 
Intervals are presented in Fig. 12.16. On average, the vibrating setups were preferred 
to their non-vibrating versions: all mean estimates but one are above 0.50 (the point 
of perceived equality) as well as most of the respective credible intervals. 

The model output showed the following effects.t The mean parameter for Audio 
vibration was not credibly different from Sine vibration, while Noise vibration was 
rated credibly lower. Sound type had a credible effect on the mean parameter (u) 
only in combination with Noise vibration. Expression and Enjoyment both had a 
rather credible positive effect, although slightly short of 95%, on the mean parameter 
relative to Preference and Control. However, many of the manipulated factors had 
credible effects on the precision parameter (@) and on the zero-inflation parameter 
(zoi), suggesting that even if the means are not credibly different, the shapes of the 
respective distributions may differ. 

The main findings of this study may be summarized as follows: i) although not 
large, the measured effect of Sine or Audio vibration was appreciably positive. ii) 
Noise vibration did not credibly enhance the subjective quality of the interface 
as compared to the non-vibrating condition. iii) Vibrotactile feedback especially 
increased the perceived expressiveness of the interface and the enjoyment of play- 
ing. As appears from Fig. 12.16(a), a more marked effect was found when vibration 
was more similar to the sonic feedback and consistent with the user’s gesture: Indeed, 
Sine and Audio vibration follow the pitch of the produced sound and their intensity 
can be controlled by pressure. Conversely, Noise vibration—offering fixed ampli- 


4 Note that unlike the other studies reported in this chapter, these data were analyzed using Bayesian 
inference; therefore, we use the term “credible” instead of “significant” of effects. 
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tude, independent of the input gesture, and flat spectrum—was rated lowest among 
the vibrating setups. Noise vibration resulted in slightly better ratings when Sound 2 
was used as compared to Sound 1: Again, that was likely because vibrotactile feed- 
back is consistent, at least partially, with the noise-like sonic feedback produced 
for high y-axis values. Interestingly, no credible difference in the globally positive 
effect was found between Sine and Audio vibration. This may be at least partially 
explained by a masking effect taking place in the tactile domain toward higher fre- 
quencies, thus impairing waveform discrimination [5]. However, such phenomenon 
seems not to apply to markedly different signals [49]. In this regard, our informal 
testing revealed that Sine and Audio vibration were virtually indistinguishable, espe- 
cially when Sound 1 (modulated sawtooth waveform) was selected. 

Response consistency across repetitions was evaluated by modeling participants’ 
first- and second-round responses by linear regression. Pooled over participants and 
factor combinations, the regression coefficient (6 = 0.32, p < 0.001) indicated a 
general overall consistency (i.e., participants preferred the same vibrating or non- 
vibrating setup twice across repetitions). However, ten participants frequently pre- 
ferred once the vibrating and once the non-vibrating setup in the same factor com- 
bination, resulting in regression coefficients <0 (mean coefficient over the N = 10 
subjects was 8 = —0.19). The remaining subjects (N = 19) instead gave consistent 
ratings (6 = 0.53). Interestingly, the inconsistent group (N = 10) spent noticeably 
less time with the tasks than the reliable group (N = 19): the median length of their 
gestural data logs was only 62% of that of the consistent group. In order to estimate 
the effect of the inconsistent participants, we re-run the ZOIB model including only 
the N = 19 consistent subjects and finding that the main result was similar to the 
full dataset: only vibration type had a clearly credible effect on the estimated mean 
parameter. However, this way the effect is somewhat larger, as the mean estimates 
for vibration types Sine and Audio (with Sound 1) slightly increase, while that for 
Noise decreases (see Table 12.3). Also in this case, Expression is the highest rated 
attribute; its marginal mean estimate increases from 0.59 to 0.64 (see Table 12.4). 

As the participants were highly skilled musicians, we believe that the recorded 
inconsistent responses were not due to the task being too difficult. However, as 


Table 12.3 Estimated u parameters from the ZOIB fit (on original response scale) for the marginal 
effects of sound and vibration (attribute = Preference). N = 29: all subjects; N = 19: consistent 
subjects 


Sound Vibration Estimate (N = 29) Estimate (N = 19) 
1 Sine 0.563 0.604 
1 Audio 0.548 0.576 
1 Noise 0.480 0.466 
2 Sine 0.536 0.558 
2 Audio 0.552 0.550 
2 Noise 0.512 0.493 
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Table 12.4 Estimated u parameters from the ZOIB fit (on original response scale) for the marginal 
effects of Attribute (sound = Sound 1, vibration = Sine). N = 29: all subjects; N = 19: consistent 
subjects 


Attribute Estimate N = 29 Estimate N = 19 
Preference 0.563 0.604 
Control 0.562 0.594 
Expression 0.594 0.645 
Enjoy 0.591 0.628 


they were not screened for individual vibrotactile sensitivity, it is possible that they 
did not feel vibrations equally. On top of that, we argue that rating inconsistency 
may be linked to the varying perceived vibration strength and audio-tactile congru- 
ence, depending on where and how the participants were playing over the interface’s 
surface. Indeed, vibrotactile intensity perception is affected by vibration amplitude 
(obviously), spectral content (with a peak in the 200—300 Hz range [55]), and the 
exerted pressing force [42]; also, varying degrees of spectral and temporal similarity 
between auditory and vibratory feedback may result either in cross-modal perceptual 
integration or interference [60]. However, we specifically chose a free playing task in 
order to measure the effect of vibrotactile feedback on various aspects of the playing 
experience. 

With regard to the coherence of specific audio-tactile combinations, although 
Noise vibration resulted in very uniform ratings when associated with Sound 1 
(6 = 0.56, p < 0.001), it produced the lowest rating consistency with Sound 2 
(6 = 0.16, p < 0.05). While this was obviously affected by the general tendency 
of ten participants toward inconsistent ratings, one may also consider the varying 
degree of similarity between Sound 2 and Noise vibration: at the upper range of the 
y coordinate Sound 2 was noise-like, while for lower y values it was increasingly 
sinusoidal; inconsistency might follow from having played once mostly at high y and 
once mostly at low y. Conversely, Sound 1 retained the same degree of (dis)similarity 
with Noise vibration, independent of the playing position/style. Overall, the noticed 
inconsistency of responses sets a future challenge for screening the participants and 
controlling the playing task. 


12.9 Conclusions 


Based on the reported results, we suggest that the design of future multisensory 
interface technologies, especially if applicable to music performance, should take 
into consideration the addition of advanced vibrotactile feedback. This would enable 
the re-establishment of a consistent physical exchange between users and their dig- 
ital devices—similar to the natural relationship that musicians establish with their 
instrument, where the source of sound and vibration coincides—with the demon- 
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strated potential to enhance the experience and the perceived quality of the interface. 
Indeed, several participants in the reported musical studies were impressed with the 
novelty and “aliveness” of haptic interfaces, as opposed to their experience with 
existing digital musical devices. 

Ultimately, it is yet to be seen if and how such subjective enhancements may 
be reflected in the quality of playing, and musical performance altogether. Making 
objective measurements of these aesthetic aspects however poses a major research 
challenge, and the present work only scratched the surface in this direction. Instead, 
this will be the main object of a follow-up experiment currently in the works. 
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Victor Zappi, Dario Mazzanti, and Florent Berthaut 


Abstract Immersive virtual musical instruments (IVMIs) lie at the intersection 
between music technology and virtual reality. Being both digital musical instru- 
ments (DMIs) and elements of virtual environments (VEs), IVMIs have the potential 
to transport the musician into a world of imagination and unprecedented musical 
expression. But when the final aim is to perform live on stage, the employment of 
these technologies is anything but straightforward, for sharing the virtual musical 
experience with the audience gets quite arduous. In this chapter, we assess in detail 
the several technical and conceptual challenges linked to the composition of IVMI 
performances on stage, i.e., their scenography, providing a new critical perspective 
on IVMI performance and design. We first propose a set of dimensions meant to 
analyse IVMI scenographies, as well as to evaluate their compatibility with different 
instrument metaphors and performance rationales. Such dimensions are built from 
the specifics and constraints of DMIs and VEs; they include the level of immersion 
of musicians and spectators and provide an insight into the interaction techniques 
afforded by 3D user interfaces in the context of musical expression. We then analyse 
a number of existing IVMIs and stage setups, and finally suggest new ones, with the 
aim to facilitate the design of future immersive performances. 
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13.1 Introduction 


Making music while immersed in a virtual environment (VE) is an exciting experi- 
ence. In a synthetic space designed to replicate and transcend our world, we gain the 
ability to become composers and performers of inventive musical pieces, that lever- 
age unprecedented acoustical phenomena (virtual sound), mechanical phenomena 
(virtual interaction) and perceptual/cognitive phenomena (virtual experience). This 
is possible thanks to the design of virtual devices that, alike digital musical instru- 
ments (DMIs), transform interaction into sound, but also allow to channel musical 
expression through the peculiar features of the surrounding VE. For example, in a 
world where sound can be visible as well as tangible (like the one created by Lanier 
to host his “Virtual Instrumentation” [49]), virtual musical devices might permit not 
only to play notes but also to manipulate them, as they are still echoing in the air. 
In other VEs, their design might leverage the possibility to fly or teleport through 
space to create huge distributed instruments that would be otherwise impossible to 
manoeuver. What these apparently disparate musical applications share is immer- 
sion, meaning that physical equipment, virtual content and the overarching logic 
that binds the two are geared towards the amplification of the physical and cogni- 
tive involvement of the musician acting in the environment. Being both DMIs and 
elements of immersive environments, or even extending across whole VEs, we call 
these devices immersive virtual musical instruments (IVMIs); they typically consist 
of virtual representations of sound processes and parameters [11, 35], and rely on 
immersive multimodal technologies to support fine 3D interaction. 

As it happens in the case of most musical instruments, composition or studio prac- 
tice with IVMIs may be considered an end itself [88]. The musician that is immersed 
in the VE may experience the feeling of satisfaction typical of the completion of 
challenging musical tasks [58]. Moreover, in this scenario satisfaction is likely to be 
combined with the sense of discovery that characterises virtual reality (VR), as the 
IVMI may feature novel musical affordances [10] or, more in general, unusual sets 
of sensory-motor contingencies [79]. While some musical VEs are designed specif- 
ically to elicit such autotelic responses! [35, 72, 96], the final aim of a considerable 
number of musicians is to perform with their IVMI in front of various audiences and 
use it to create some sort of connection with them. 

Unfortunately, the step from the studio to the stage is anything but straightforward. 
First of all, the rehearsal spaces of most IVMI players are not standard music studios. 
Before the release of consumer head-mounted displays (HMDs) and the rise of VR 
videogames, IVMIs were almost exclusively designed (and played) in VR research 
facilities, equipped with minimal audio gears like an audio interface and a mixer [10, 
56, 94]. Nowadays immersive technology is more affordable and VR musicians may 
have access to spaces more affine to traditional music facilities [36, 53]; however, 
in these studios professional audio equipment still needs to be laid side by side 
with tracking systems, HMDs, projectors, all connected to dedicated computers. The 


' We may label this kind of musical VEs with the umbrella term installations. 
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showcasing of live IVMI performances—even the simplest ones—inevitably relies 
on the employment of such heterogeneous studio equipment, which has to be moved 
to the venue and arranged on stage. 

But it is not the size nor the complexity of the setup alone to qualify IVMI 
performances as challenging artistic endeavours. Rather, the real hurdle for VR artists 
comes from the nature of the required equipment. Musicians active in contemporary 
popular and underground music scenes are no stranger to the on-stage employment of 
remarkably complex setups, the most straightforward example being a generic rock 
band playing electric, electronic as well as acoustic instruments altogether during 
the same show. However, when a rock band steps on stage and the concert begins, 
the chosen technological setup immediately proves itself fundamental to support 
musical expression and to create a synergetic connection with the audience. The 
instruments fuse with the bodies of the members of the band and each gesture acquires 
a clear musical meaning; cables and speakers disappear from sight, concealed by 
the music and by the flashing lights, that equally immerse the musicians and the 
audience into a large shared audio/visual environment. Unfortunately, this inclusive 
scenario is in stark contrast with the musical experience that is delivered on average 
through an IVMI performance. When inside the VE, the musician fully leverages the 
immersive technology laid on stage to approach the IVMI’s logic and control it, as if 
the instrument were physically there in front of them. But to the eyes of the audience, 
this interaction is quite cryptic, even abstract. This is due to the fact that spectators 
are confined outside of the VE. They see the musician assuming awkward poses 
and contorting themselves while handling invisible objects on a semi-empty stage. 
The only visible clue about the existence of the VE is the technology that surrounds 
the performer, which mediates the interaction between the physical and the virtual 
world, but tells very little about the mechanics of the latter. Without a clear view 
of the virtual objects and their response to interaction, what remains is just a music 
piece almost completely disconnected from the gestures and the physical presence 
of the musician. 

Some may argue that the potential of immersive music extends way beyond the 
virtualisation of the sole performer’s experience. For a moment we can forget about 
stages equipped with immersive gears and even venues, and rather imagine shows 
taking place in completely synthetic worlds, that spectators access remotely from 
their living rooms. This is one of the many social expressions of contemporary VR 
culture [62]. Showcasing a fully virtual performance surely helps the audience see the 
show in its entirety, and better appreciate interaction and aesthetic nuances. However, 
research proved that VR setups and networked technologies available today are not 
yet capable of providing the same sense of connection triggered by social activities 
set in physical reality, let alone by complex psychophysiological experiences like 
concerts and live performances [50]. 

Leveraging immersion to create a sense of connection/inclusion clearly becomes 
the main challenge for VR artists and [VMI designers, and the very immersive equip- 
ment they rely upon seems to get in the way. During the first experiments with IVMIs 
carried out in the early 1990s, this scenario was not necessarily considered a limita- 
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tion to artistic expression.” Conversely, it was taken as an opportunity to explore a 
new relationship between performer and audience [49]. Immersive technology was 
employed to create novel musical instrumentation, but at the same time concealed 
this instrumentation and made the musician unable to know how this looked like to 
the audience. Differently from a traditional musical performance, this scenario did 
not elevate the status of the performer; it put performer and audience on the same 
level instead, providing IVMIs with a distinct aesthetic. However, while a similar 
peer relationship could be achieved by means of other forms of live art too [49], some 
emerging properties of VR appealed new media artists for their utter uniqueness [57, 
98]. This caused a rapid change of paradigm and by the mid 1990s the focus of VR 
performances became revealing a new world, as opposed to concealing it. Immersive 
technology started to be praised for its potential to offer to the audience a stake in 
the VE, in the form of a vicarious experience [31]. In other words, for the first time 
VEs were conceived as spaces where to live an experience not only via first-person 
interaction (as in the case of the performer), but also by observing someone else 
interacting. This emphasised the importance of being able to perceive continuity 
between the performer’s gestures and the resulting sounds and visuals, to connect 
with the performer, and share with them the same virtual musical experience. 

Today, the design of most IVMIs and VR performances is based on the same 
rationale [10, 36, 41], as artists aim for shared experiences and connection with the 
audience. But in practice this is a costly goal. The setting up of such musical VEs 
requires beforehand a strong commitment to understanding both what performing 
music means and how VR affects action and cognition, as well as a fair amount of 
equipment ready at hand. Alas, in real-case scenarios mental and physical resources 
tend to be limited; trade-offs happen to be extremely common, either in terms of 
a reduction in the sense of agency and immersion of the performer (to favour the 
audience’s side), or as an overall depreciation of what attending Jive music truly feels 
like (biasing the performer’s role). 


13.1.1 The Role of Scenography 


The impending gap between performer’s and audience’s virtual musical experiences 
is a complex phenomenon that has to be accounted for in every IVMI performance. 
But how can we measure the entity of this gap? And how can we intervene to reduce 
it? What we suggest is to embrace a larger perspective on performance practice, by 
means of applying scenographic theory to the domain of IVMIs. 

In theatre, cinema and television, scenography relates to the study and the develop- 
ment of audio/visual, spatial and experiential composition of performance, by taking 


2 Here the use of the term “experiments” does not want by any means to devalue the artistic signif- 
icance of these early performances; it refers to the inability to predict the effects that such a novel 
technology would have had on the overall vibe of the shows and on the audiences’ experience. 
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into account the perspective of two main stakeholders: performer and audience.? In 
the case of IVMIs, the dichotomy between performer’s agency and audience con- 
nection is due to the constraints of immersive technology; nonetheless, analogous 
issues often arise when staging a play, filming a movie or broadcasting a live show, 
even if completely different sets of technologies are in use. Shots of magicians or 
jugglers are quite challenging for example, as the presence of one or more cameras 
makes it more difficult for these performers to disguise or show off their dexterous 
movements. A famous example consists of the crystal-ball scenes in Jim Henson’s 
1986 movie Labyrinth, where artist Michael Moschen had to juggle blindly with only 
his right hand in the frame while hiding the rest of his body behind David Bowie. 
In theatrical plays, it is the physical presence of an audience to challenge the per- 
formers instead. Actors are often compelled to face the spectators while interacting 
with stage props or with each other, sacrificing visual feedback/eye contact to create 
a sense of inclusion. One of the main aims of scenography is to account for these 
and similar scenarios. A well-designed scenography fully immerses the spectators in 
the production, eliciting emotional and rational engagement [59] while seamlessly 
synthesising the performers’ and the audience’s experiences [44]. 

In the context of IVMIs, the scenography of a performance may be defined as the 
complete setup chosen to reproduce the instrument/VE on stage, make it playable 
to the musician and present it to the audience. This includes (1) the technology ded- 
icated to the immersion of the performer, like displays (e.g., HMDs, projections), 
tracking systems (e.g., head tracking, full-body motion capture and active/passive 
markers), physical user interfaces (e.g., joysticks and haptics) and sound monitors 
(e.g., headphones and speakers); (2) the technology addressing the audience’s expe- 
rience, like large screens, projected surfaces, lights and the power amplifier system; 
and (3) the spatial arrangement of such technologies on stage, taking into account the 
freedom of action required by the musician to play the instrument, as well as size and 
position of the stage compared to the seat area or the parterre. In line with general 
scenographic theory, such a practice extends across design, curation and technical 
development [44]. 

When included in the design process of an IVMI performance, the development 
of a specific scenography may change how the VE is experienced, starting from 
disentangling immersive technology from the concept of user. Such a term is at the 
basis of most—if not all—conventional design approaches to VR, which tend to rep- 
resent “the human subject as an omnipotent and isolated viewpoint” [30]. The great 
majority of IVMIs comply with this rationale. This is the main reason why, on stage, 
VR technology is almost exclusively employed to immerse the musician, creating 
the sense of isolation that we discussed earlier in this section. As opposed, musical 
VEs conceived for live performances would highly benefit from the exploration of 
novel design approaches, possibly discarding strict user-centric solutions. 


3 Scenographic theory often includes the figure of the director among the stakeholders of production; 
to make a simpler connection with IVMI performances we omit this detail, as in most cases the 
artist playing the IVMI is also the composer of the piece as well as the designer of the instrument. 
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In this context, scenographic theory may provide valuable guidelines to support 
experimentation. By dedicating time to the study and the development of a proper 
stage setup, artists may devise new ways of employing immersive technology, in 
compliance with the performer—audience paradigm that lies at the core of sceno- 
graphic theory [59]. In other words, by definition, a scenography has the power to 
turn the VR user into a performer, fostering an experience that is suitable for the 
entertainment of an external audience. 


13.1.2 A New Approach to IVMI Performance 


The design of a proper IVMI scenography is not an easy task though. In particular, 
the transition from user to performer proves to be a critical hit, as witnessed by 
the remarkable number of musical VEs that have been designed as instruments but 
never reached the stage. Indeed, despite the relatively large literature, very few works 
report the showcase of musical VEs in the context of concerts, for most IVMIs tend 
to be used as installations or research platforms, rather than as instruments for live 
performance [35, 47, 56, 94]. 

The aim of this chapter is to address this problem and combine theory and prac- 
tice to facilitate the design of IVMI scenographies. To do so, we propose a set of 
dimensions specifically conceived for the analysis of immersive stage setups. Such 
dimensions form a set of evaluation criteria that reflect the twofold nature of IVMIs. 
They stem from the detailed examination of the specifics of immersive VEs, com- 
bined with the practicalities of live music performances and in particular those fea- 
turing DMIs. As a consequence, their application allows to extrapolate from a chosen 
stage setup the technical characteristics that affect these factors and to qualitatively 
evaluate their individual impact on the showcasing of a generic IVMI performance. 
Furthermore, when the stage setup is coupled with a specific immersive instrument, 
the outcome of the analysis provides quick metrics to assess the experiential gap that 
likely divides performer and audience, also highlighting the main causes of such a 
disconnection. 

It is worth noting that the scope of this work extends beyond the domain of VR. 
As detailed in the following sections, virtual performances and scenographies often 
span augmented and mixed realisations too, including see-through visual displays 
and a combination of physical and virtual stage props. This is the reason why we 
are referring to immersive virtual musical instruments as opposed to virtual reality 
musical instruments (commonly referred to as VRMI [78]), the latter being for the 
most part a sub-category of the former. 

The actual relationship between these two classes of musical devices appears clear 
in the context of the categorisation of interactive environments proposed by Milgram 
and Kishino [65]. The two authors introduce a single continuous axis—the “virtu- 
ality continuum’—that goes from real environments (where everything is physical) 
to VEs (that host synthetic elements only), and encompasses in between all kinds of 
environments that mix physical and synthetic entities. On such a continuum, VRMIs 
belong to the far end of the spectrum (“virtuality”), and are distinct from devices that 
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rely on technologies that lie closer to the “reality” end, like for example augmented 
reality (AR). However, the authors point out that virtual, augmented and any other 
kind of mixed technology can be characterised by different levels of immersion, 
regardless of their location on the continuum (see Sect. 13.3.1 for a more thorough 
discussion about immersion and its degrees of execution). In line with this perspec- 
tive, IVMIs do not belong to a single point in the continuum, rather they cut across 
the spectrum; they include VRMIs, AR musical devices and any mixed solution in 
between, provided that the design of the instrument targets immersion. 

Before moving forward, we’d like to remind the reader that this chapter is an 
extended and revised version of a pre-existing work, published by the authors in 
2014 [12]. The decision to try to improve our contribution to this challenging research 
and artistic field comes from a very practical consideration. Over the seven years 
following the original publication, VR technologies have settled in the world of 
videogames and consumer electronics, and today the result of this process is the 
emergence of a new generation of immersive instruments and performances. For 
the first time, musicians have access to commercial IVMIs alongside more afford- 
able and reliable resources for do-it-yourself development, to make new music and 
engage with new audio/visual experiences. And as expected, this is happening both 
inside and outside music studios. Companies are teaming up with underground as 
well as mainstream artists to popularise the use of new immersive devices in perfor- 
mance settings, starting the exploration of innovative stage technologies to sell on 
the market of music and entertainment. In this vibrant scenario, the need for guid- 
ance in performance and instrument design is stronger than ever. The way we try 
to fulfil this need is by presenting a new set of cross-domain dimensions; by doing 
so, we aim at combining in one single critical perspective the practical—as well as 
cultural—implications that derive from the latest development of immersive musical 
technologies. 

In line with this purpose, the rest of the chapter is structured as follows. Sec- 
tions 13.2 and 13.3 discuss the main technological as well as experiential factors that 
play a role in the context of DMI performance and of immersive VEs, respectively. 
We will refer to these factors as constraints, a term originating from human—computer 
interaction [68] yet widely used in both DMI and VR literature [20, 32, 38, 93]. In 
particular, we embrace Magnusson’s take on the subject, which deems constraints 
complementary to the affordances of an artefact/system [55]; in the context of this 
work, this means that by following cultural conventions and by adhering to techni- 
cal and psychophysical requirements, it is possible to express at best the potential of 
DMIs, VEs as well as IVMIs. Starting from such constraints, in Sect. 13.4 we provide 
a detailed presentation of the set of dimensions we conceived to support the practice 
of scenographers and IVMI designers. Then, the following two sections exemplify 
how the dimensions may be applied to real-case scenarios. Section 13.5 analyses an 
assorted selection of [VMI performances spanning the last 30 years, with the aim to 
assess the type of experience provided to musicians and audience across all dimen- 
sions; while Sect. 13.6 shifts the focus on the future of immersive scenography, as 
we introduce novel stage setups and we use the dimensions to frame their potential 
when combined with IVMIs. Some of the solutions discussed in these two sections 
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provide concrete examples of how to bring a musical VE to a live stage not only by 
using immersive VR technologies, but also by combining AR equipment/paradigms 
within the setup. Finally, conclusions are drawn in Sect. 13.7. 


13.2 Constraints of the Digital Live Music Experience 


DMIs are flexible tools that allow for the exploration of original musical and design 
practices. The vast potential granted by digital technologies makes it possible for 
designers and players to embrace the most daring sensing and interaction techniques, 
and to combine them with sound synthesis technologies that can also extend into the 
analog domain [61, 76]. Moreover, any mapping between musician’s gestures and 
sound parameters can be devised almost arbitrarily, removing further limitations 
from the creative process [45, 91]. 

Unfortunately, this design freedom leads to great challenges when transferring a 
DMI from the studio to a live stage.t The type of musical exploration afforded by 
DMIs often manifests itself through bizarre and do-it-yourself equipment, unusual 
gestures, abstract sounds and idiosyncratic mapping between them. If not properly 
contextualised (in both a broad and a literal sense), these very distinctive features 
may impinge on the audience’s experience of the show, as well as on the technical 
and expressive proficiency of the performers. 

This section looks at DMIs from the perspective of live music performance. In 
particular, we discuss what we consider the main constraints linked to this form 
of expression/entertainment. Although centred on novel digital technologies, the list 
includes constraints that may as well inform the design of performances for traditional 
instruments. However, their overall impact is far more relevant when relocated within 
the domain of DMIs. 


13.2.1 Stage Performance 


On stage, performers need to be comfortable with their instruments. In an ideal 
scenario, a DMI plays the same regardless of where it is played, allowing the musician 
to build a live performance around the same affordances explored in studio and 
rehearsal spaces. Unfortunately, this is not always the case. Some DMIs are big 
and complicated, composed of parts that are difficult to assemble/disassemble or 
simply fragile. When dealing with such designs, the way the instrument is set up 
on stage often differs from the original studio configuration, forcing the performer 
to adapt their playing postures or even to sacrifice important visual/sound/haptic 
cues. Other musical systems impose requirements on the specifications of the stage 
itself. These include peculiar lighting, accurate microphone placement and support 


4 A great portion of the DMI literature discusses how detrimental the apparent lack of design 
limitations may be on the musical appropriation of the instrument too, but this is beyond the scope 
of our work. 
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Fig. 13.1 López’s playing at the NIME Conference 2011; despite the use of curtains and dimmed 
lights, the instrument’s optical tracking system kept malfunctioning, forcing López to adapt the 
execution of his pieces to the adverse situation. Image courtesy of Alexander Refsum Jensenius 


for multichannel audio playback; sometimes calibration procedures are required too, 
as in the case of multichannel audio or motion capture. If any of these elements is 
missing or merely not compatible with the DMI, some of the musical affordances 
and functionalities the performer grew accustomed to may become unavailable, right 
before the start of the show. A notable example of this contingency is the opening 
concert of the NIME Conference 2011. On that occasion, Carles López had to rework 
his performance on the fly since the adverse lighting conditions of the stage made a 
large portion of his Reactable unresponsive (Fig. ). 

Approaching the stage unprepared clearly is a hazard and not all performers have 
the flexibility displayed by Lopez (his performance was a success!). To avoid this 
risk, it is not uncommon for DMI musicians to organise live events in their practice 
studios, leveraging the very spaces where the instruments were designed, built and 
tested [97]. Yet, the appeal of a real venue is invaluable. Creators of a musical 
performance involving DMIs should dedicate particular attention to the phase of the 
stage setup. Issues and necessities should be anticipated with care, from the most 
general and basic ones to the most specific and complex ones: will cables be long 
enough to connect the required hardware across the stage? How long does it take to 
calibrate and prepare the instrument? Are any of the pieces of equipment employed 
in the design difficult to install/use on a regular stage? Similar questions should 
arise early in the DMI’s creation process, and could very well affect its design and 
behaviour. 
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13.2.2 Communication Between Performers and Audience 


Communication between performers and audience is another fundamental aspect of 
musical performances. Often coupled with cognition, communication is one of the 
main terms in use in music psychology and emotion research to frame the experi- 
ence of playing and attending a live show. In this context, communication refers to 
the musician sending encoded messages for the spectators to be interpreted; more 
specifically, these resolve into music as well as related actions conceived to trigger 
specific emotions in the listener/viewer [48, 71]. However, authors like Gurevich, 
Trevifio and Fyans thoroughly discussed how the application of such a model in 
the domain of DMIs is quite controversial, as it does not account for experimental 
and improvisatory music (to name a few), nor for the non-instrumental/intellectual 
engagement such instruments seem to be better suited for [37, 39]. 

To escape diversions from the main topic of this chapter, in this work we adopt 
a definition of communication that is more akin to Bonger’s discourse on human— 
machine interaction [17]: non-verbal and non-necessarily musical cues that define 
the interplay existing between audience and performers. From this perspective, com- 
munication can be analysed as a constraint, rather than as a yet-to-understand factor 
of music cognition. 

Nonetheless, the way such a constraint is dealt with when designing a perfor- 
mance can still deeply influence and characterise the live music experience. Musi- 
cians should be able to perceive reactions of the audience, in order to adjust their 
playing and get a feeling of the ambience. For example, an improvised section could 
last longer or be cut short based on the hints performers can get from the audience. Or 
in large venues, performers could feel like getting physically closer to the spectators, 
or move around the stage also based on non-verbal cues. Spectators can communi- 
cate actively their emotions and appreciation to the performers via social and cultural 
conventions too, for example through gestures, like applauding and shouting. Sym- 
metrically, spectators should perceive musicians’ expressions, gestures and looks, 
which are part of their playing style and together with the sonic outcome contribute 
to outline the performance. To this end, stages for live performances played in front 
of huge crowds typically include big screens showing close-ups of the musicians. 

Apart from the direct interplay between the parties, Bonger describes also another 
type of communication, happening in the context of “performer—system—audience”’ 
interaction. Performer and audience can indeed use the very technology setup on 
stage/in the venue (the “system”) as a communication channel beyond sound and 
music. In his work, he discusses performances in which multimodality and—in 
particular—VR technologies are leveraged by the musicians to provide visual and 
haptic stimuli to single spectators, as well as multiple members of the audience. 
Finally, this type of communication includes the case of participatory performances, 
with spectators being able to use the system to input content in the performance and 
share information with the musicians. Some examples are the use of text messages 
as both sonification and literal communication means [29], or votes on the preferred 
type of music [92]. 
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13.2.3 Music Ensemble 


In performances involving multiple musicians, group dynamics are an essential 
aspect of both the performers’ and spectators’ experience. In fact, the interaction 
between musicians among the ensemble when playing DMIs may differ from what 
happens with traditional instruments. Moreover, the difficulty in understanding the 
musician’s gestures might increase with an ensemble, as it may also be difficult for 
the spectators to understand who is doing what [64]. Collaboration modes in dig- 
ital ensembles can be separated into cooperation, communication and organisation 
modes [9]. Cooperation modes, when concurrent or complementary, allow musicians 
to share parts of sound generation processes or even allow other musicians to play 
their instrument. These choices and changes can be highlighted for the benefit of the 
audience—or even of other musicians, if they are not involved in the sharing process. 
Communication modes, such as exchanging messages or gestures indications, can 
also be amplified for the audience, as done in [51] since they can be less visible 
than with acoustic ensembles. Finally, organisation modes, which allow musicians 
to define roles such as conductor and groups within the orchestra, are usually obvious 
from spatial arrangement of musicians in acoustic ensembles and might need to be 
reinforced for the audience of digital orchestras. 


13.3 Virtual Environments and the Constraints 
of Immersive Experiences 


Compared to the case of live digital music, the compilation of a list of constraints 
capable of informing how we experience virtuality may seem overwhelming. The 
design of most, if not all, live DMI performances targets the delivery of one or 
more musical pieces; and while the details of the chosen technological setups vary 
from performance to performance and from artist to artist, their employment on 
and off stage is always dedicated to supporting the re-creation and the diffusion of 
the featured music (as discussed in the previous section). Conversely, the variety of 
applications and scopes of systems capitalising on VR is astonishing, spanning indus- 
trial design [46], psychological and physical therapeutics [43], military training [52] 
and—of course—musical applications, just to name a few. From this perspective, it 
is quite hard to pin point all the requirements of such systems and scenarios and to 
address in a single discussion the contingencies relative to employed technologies 
and common practices. 

Luckily, the literature in VR research highlights an overarching theme that is 
common to all VR applications, and that can be used as the lens through which to 
analyse the constraints affecting the users’ experience of generic VEs. This theme 
is the search for presence. In particular, presence has been described as the psycho- 
logical sense of “being in the VE” [84], a specific state of consciousness that ought 
to be experienced by VR users. In an optimal scenario, when a user feels “present” 
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in a virtual world, they act as if the environment were real, physically and emotion- 
ally engaged in the application. Therefore, presence is targeted by all designers of 
VEs, regardless of the specific scope of the application or the technical details of 
the system. Furthermore, the concept of presence is tightly connected to disciplines 
like physiology, perception and psychology [60, 80, 82], making carefully designed 
narratives, settings and tasks necessary for it to be triggered. 

In line with this scenario, we can consider constraints of VEs and all the tech- 
nological and experiential factors that play a role towards the establishment and 
the preservation of the feeling of presence in VR applications. In this section, we 
gather and discuss such constraints, placing emphasis on the aspects that will have a 
particular significance when crossing the domain of musical performance. 


13.3.1 Immersion 


Immersion is a key constraint in VR. The term refers to the description of the tech- 
nology used to make the user feel present in the VE [81]. Immersive VR applications 
are characterised by a combination of equipment and techniques, the most common 
being wide stereoscopic viewports, multimodal feedback, detailed graphics, high 
framerate and large tracking areas. Such an arsenal may sound quite heterogeneous, 
and in fact it is not trivial to design and combine all its components as functional 
elements of a robust global system. Yet, the effect that this class of immersive tech- 
nologies has on presence is immediate and conspicuous such that researchers used 
to identify their technical specifications as the main constraints of VR [20]. But, 
nowadays, other immersive features are deemed fundamental too. For example those 
pertaining to the design of the content of the VE, and in particular of those details that 
grant a coherent perception of the virtual objects, the surrounding virtual world and 
the virtual representations of the body of the user. In technical terms, this translates 
into scale, perspective and alignment. On a cognitive level, this coherence relies on 
components such as place illusion and plausibility, i.e., the sensation that the place 
and events occurring are real [79]. In a musical performance context with 3D avatars, 
plausibility, for example, seem to be strongly linked to eye contact with the musi- 
cians [8]. Presence is also strengthened by virtual body ownership, i.e., when one 
perceives their virtual avatar body as their own. 

As discussed in [25], the effects of all these technologies and techniques are highly 
interconnected with one another. Moreover, the absence or the misuse of any of them 
may produce immediate breaks in presence [19]. For example, in a poorly designed 
VR setup the user may end up pulling the cable of a tracking device, or may thrust 
their hand through a virtual object, revealing its inconsistency. Similar contingencies 
have both perceptual and physiological consequences on the users, which can be 
measured to determine the extent of the experienced loss in presence [86]. Hence, 
immersion often fulfils the role of a filter too. Equipment and design techniques can 
be employed to block unwanted stimuli that come from the real world surrounding 
the VR setup, and that are often collectively referred to simply as noise. These include 
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the touch of hard boundaries of the tracking space, like sensor stands and walls [16, 
67], or even the voice of people conversing close by.° Properly filtering noise from 
a VR setup is not enough to avoid all breaks in presence, but it is a good practice to 
minimise those caused by external reasons [80]. 

The strong connection between immersion and equipment means that different VR 
setups are characterised by more or less pronounced immersive features, regardless 
of the actual applications run with them. Then, the overall feeling of presence experi- 
enced by the user will depend on the specific combination of the immersive setup and 
the immersive design features of the software. But the role of the equipment/setting is 
so prominent that often times VR setups are assigned labels hinting at their intrinsic 
level of immersion [75]. These labels range from fully immersive, like the case of 
consumer VR headsets available nowadays and composed of HMDs, head and hand 
trackers, to non-immersive, denoting monoscopic screens and general-purpose input 
devices, like mice, buttons and joysticks.° 

The case of partially immersive setups (also labelled semi-immersive) is par- 
ticularly interesting for the purpose of this work. Most of these mid-tier solutions 
capitalise on stereoscopic monitors and stereo-projected screens, occasionally cou- 
pled with head tracking. The result is a window on the virtual world, whose size 
is proportional to the rendering/projection area. Hence, the smaller the window, the 
more likely for the elements of the VE to end up beyond the clear-cut boundaries of 
the visual display and disappear from sight—especially during manipulation or loco- 
motion. This eventuality endangers presence and represents the main technological 
limitation of partially immersive setups. And similar risks affect AR applications 
too, which leverage setups belonging to the same class. 

Yet, monitors and projected screens provide VR designers with the opportunity to 
seamlessly combine real and virtual elements in the virtual experience. For example, 
a large projected setup allows to perceive the real hands and body of the user literally 
inside the VE, along with the virtual objects that populate it (or virtual objects 
inside the real world, as in the case of AR). Moreover, real-world objects and props 
may be used to carry out virtual interaction, hence entering the domain of hybrid 
reality [65]. As a consequence, the overall level of immersion that is achievable when 
using partially immersive setups largely depends also on reality—virtuality continuity, 
i.e., the set of immersive design features aimed at generating a consistent perceptual 
connection between the real and the virtual world. We can consider reality—virtuality 
continuity as an extension of the triad scale/perspective/alignment that entangles 
rendering and tracking with the physical properties of real space. 

AR displays can be used to cover a range of setups from partially immersive, e.g., 
integrating a few virtual elements in the physical space, to almost fully immersive, 
e.g., placing users in a virtual room or on a virtual stage. Because the physical space 


5 Interestingly, this happens both in the context of research experiments, during which the subject 
may hear comments from the investigators or other lab members, and in casual game sessions, when 
friends/observers support the immersed player. 

6 Indeed, when you are playing a videogame on your laptop or console, you are actually experiencing 
non-immersive VR! 
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remains visible in all these cases, AR inherits from usual performance conditions: 
direct visibility of the performers and other spectators, and visibility of one’s own 
body. These setups may also amplify errors in scale/perspective/alignment provided 
by a 3D display. However, these displays may also restrict 3D interaction opportuni- 
ties to a subset of the techniques described below if part of the physical environment 
remains visible (e.g., navigation in a VE might be perceptually confusing if it is not 
correctly designed). 


13.3.2 3D Interaction 


Immersive technologies and design features are not the only means to trigger a sense 
of a presence. In most cases, being able to interact with the VE encourages the user to 
deem the virtual world they are immersed in as real, and to forget that the experience 
is actually taking place in a different physical space. In other words, interaction 
is another powerful ally in the search for presence. The argument supporting this 
approach is that the reality of experience is defined by functionality rather than 
appearances, hence the sense of “being there” in a VE is grounded on the ability 
to “do” there [34, 77]. This does not mean that in an interactive system immersive 
technologies are superfluous or even a waste of money/resources; rather, interaction 
can be considered a constraint working on a different level from immersion, and both 
can be combined to describe VR more in-depth. 

Interaction with the virtual world (what we also referenced as “virtual interac- 
tion”) consists of altering the state of 3D models that populate foreground and/or 
background of the scene. This paradigm offers quite different perspectives—and 
challenges—compared to the case of interaction with 2D widgets, text and icons; for 
this reason, the term 3D interaction is often used to distinguish it from the “desktop 
metaphor” employed in traditional personal computer environments [42]. Existing 
research on 3D interaction discusses an assortment of techniques, usually classified 
using the following categories: selection, manipulation, navigation and application 
control (the latter pertaining to menus and other VE configuration widgets, as a 3D 
extension of 2D interaction). In this section, we focus on the first three categories, as 
they provide a greater variety of controls with higher dimensionalities and are better 
suited to frame musical interaction in VEs. 

Selection techniques allow users to indicate an object or group of objects in the VE. 
They are essential as they precede all manipulation techniques (i.e., to indicate which 
object will be manipulated) and some navigation techniques (e.g., to select a point of 
interest that the user wants to inspect). Several classifications of selection techniques 
have been proposed. Among them, Bowman et al. [18] classify techniques according 
to the object indication method (occlusion, object touching, pointing and indirect 
selection), activation method (event, gesture and voice command) and feedback type 
(text, aural, visual and force/tactile). More recently, Argelaguet and Andujar [1] 
proposed a set of design variables which allows for describing selection techniques 
according to, for example, the selection tool and how it is controlled, the control- 


13 From the Lab to the Stage: Immersive Virtual Musical Instruments 397 


display ratio or the disambiguation mechanism to avoid multiple selections. Most 
common techniques involve either a virtual ray/cone projected from the user through 
the environment or a virtual cursor/hand mapped to the user’s hand movements. 

Manipulation techniques allow users to modify the spatial transform of elements 
of a VE, namely rotation, scaling and translation. They can also be used to modify 
their material (albedo, texture and other shading properties) through virtual tools such 
as virtual paint brushes and 3D palettes. Other techniques focus on the modification 
of the shape of composite 3D structures or 3D meshes, in particular through virtual 
sculpting metaphors. A recent review of such manipulation techniques can be found 
in [63]. 

Navigation techniques allow users to move inside the VE. This translates to the 
discovery of new areas and details of the virtual world, often segueing into the selec- 
tion and the manipulation of virtual objects. From a technical perspective, navigation 
consists of a real-time update of the user’s visual feedback carried out by the ren- 
dering engine, which provides a consistent dynamic representation of all the 3D 
models that cross the viewport. One possible classification of navigation techniques 
was introduced by [54] and separates them into three categories. General move- 
ment comprises all exploratory displacements through the VE, for example flying or 
walking. The case of walking is of particular interest; this type of navigation sup- 
ports natural locomotion, a solution that has a strong impact on presence [83] and 
whose effectiveness can be further enhanced by means of walk-in-place immersive 
technologies and design features [67, 73]. The second category is targeted move- 
ment, which includes all techniques for which the user defines a target position and 
orientation within the VE. These can be discrete, when jumping or teleporting, but 
also continuous with smooth transitions between positions, such as those proposed 
in the Navidget technique [40]. Finally, specified trajectory movement techniques 
allow users to define a path through the VE which is then followed with different 
degrees of automation. 

In the context of musical expression and IVMIs, these categories of interaction 
techniques can and have been used for all types of gestures, including the selection 
of components of the instrument, excitation/production of sound and modulation 
of sound parameters [21]. For example, in Drile [10] a virtual ray technique is 
utilised for selecting tools and nodes of musical trees, while in Maki-Patola’s VR 
percussion instrument [56] virtual sticks are used to trigger sounds. Techniques from 
the same category can be employed for both discrete and continuous controls in 
IVMIs. For instance, 3D navigation in Versum [3] allows for continuously controlling 
the volume of sound sources placed in the virtual environment. By changing to a 
discrete navigation technique, such as teleportation, one could trigger presets of 
sound mix, eventually playing them in a rhythm. 

From a scenographical point of view, 3D interaction techniques do not all offer 
the same level of transparency [66], meaning that the influence of the musician on the 
VE can be more or less difficult to appreciate [13]. Navigation in VEs may be easily 
perceivable through the movements of avatars or changes in viewpoint. Manipulation 
and selection techniques however, especially when they involve subtle gestures (e.g., 
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button presses, joysticks, finger poses) or complex graphical tools (e.g., sculpting or 
selection disambiguation), can prove more difficult to perceive and understand. 

This effect can be reinforced in the case of techniques which require spatial 
alignment between the physical musician, or their avatar, and 3D graphical tools, for 
instance in the case of virtual rays techniques. In fact, a correct perceptual alignment 
then requires a fully immersive setup, either in VR or AR, making partially immersive 
setups less suited for specific interaction techniques. 

The 3D interaction aspect of IVMIs constitutes one of the dimensions that we 
propose and is described in Sect. 13.4.5. 


13.3.3 Collaboration and Observation 


VR is not always a solo experience. Collaborative and social VEs [7, 26] are subject 
of extensive study, that pertains to the relation among two or more immersed users and 
has yielded a large number of questions and results. In this context, users interact or 
cooperate within a shared VE, for example to collectively design industrial products 
or to join a gathering from remote locations. 

When users leverage direct collaboration to achieve a practical common task, a 
number of factors influence efficiency as well as the dynamics of personal interac- 
tion. Analogous to the feeling of presence described above, co-presence [24] can be 
defined as the sense of being together in the VE. It has been shown to depend to a 
large degree on avatar appearance, as more realistic avatars tend to elicit a stronger 
sense of co-presence, as well as on the level of cooperation required to complete 
the actual task [70]. Another important aspect of practical collaboration in VEs is 
awareness [4]. Awareness can be defined as the understanding of other users’ actions 
within the virtual world, a concept that relates strongly to the issues of musical per- 
formances with DMIs covered in the previous section. Once again, embodiment (i.e., 
the provision of users with appropriate body images) has proven to have a strong 
impact on awareness [5]. Yet, other visual cues have also been proposed, such as a 
representation of the view cone of each user, signalling what is in sight and where 
the individual focus is. 

The VR literature also discusses the case of observation without direct interac- 
tion. Virtual public speaking has been studied to understand the user/speaker’s emo- 
tional response when performing in front of virtual audiences (immersed observers), 
leading to applications in psychotherapy for social phobias [85]. Moreover, other 
experiments focused on the observers themselves, and on the levels of presence and 
arousal triggered by watching virtual interaction as carried out by other users, using 
both immersive and non-immersive setups [22, 50]. As expected, these studies sug- 
gest that the lack of active involvement makes observers feel less engaged and less 
“present” in the VE compared to users. However, when both users and observers are 
properly immersed, witnessing real-time 3D interaction showed the potential to trig- 
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ger a powerful perceptual experience, along with emotional responses way beyond 
the standards of non-immersive media and applications. 


13.4 Dimensions of [VMI Scenographies 


When designing and showcasing an immersive performance, artists have to take into 
account the full set of constraints that govern the experience of both digital live 
music and virtual interaction. Choosing the most appropriate stage technology to 
address each constraint may seem the obvious modus operandi, yet in a realistic 
scenario this straightforward approach reveals hard to apply. In first instance, some 
DMI and VR constraints appear to be orthogonal, meaning that good design in one 
domain tends to break constraints in the other. In other words, a technical solution 
specifically designed for musical purposes may end up hampering the device’s VR 
functionalities and, vice versa, efforts targeting the virtual experience often degrade 
the pleasantness of the musical performance. Furthermore, constraints from different 
domains can combine, making standard technologies and common practices suddenly 
less effective in preserving engagement and expression. 

For instance, moving to a VR audience—performer scenario, immersion deeply 
affects both the audience’s and the musician’s experience. An immersive performance 
acts on the audience’s feeling of presence within the VE used on stage. As a conse- 
quence, the virtual instrument and all its 3D graphical components can be perceived, 
to a certain extent, as “real”. In more practical terms, HMDs, single-user projections 
and head-tracking grant to the performer the level of immersion required to master 
the instrument, but exclude the spectators from the VE and cut direct communication 
between them and the performer. And the higher the performer’s immersion (i.e., the 
more refined 3D musical interaction), the less intense the audience’s experience (i.e., 
the less understanding and communication). As discussed in Sect. 13.5, the reverse 
is also true. 

In this section we define the seven dimensions of performance setups of IVMIs 
and how they relate to the musical performance and VR constraints defined above. 
These can be visualised as a dimension space, as shown in Fig. 13.2. 

More than instruments based on physical, gestural or 2D graphical interfaces, 
IVMIs may create a strong asymmetry of performance experience between musi- 
cians and spectators, depending on the display and interaction technologies used on 
each side. In turn this also generates different constraints, which we take into account 
by dedicating some dimensions to the audience experience and others to that of the 
performer’s. For instance, the following dimensions focus on the performer’s expe- 
rience: Performers Transportation, Ensemble Potential, Interaction Spectrum, Spec- 
tators Visibility. Those targeting the audience experience are Spectators Awareness, 
Spectators Transportation, Performers Visibility. By placing them on the two sides 
of the dimension space shown in Fig. 13.3 one can quickly judge the asymmetry in 
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Fig. 13.2 Dimension space to describe performance setups of IVMIs 


a given performance setup. The dimension space also distinguishes between inter- 
active aspects, in the top half of the diagram, while immersion aspects are left to the 
bottom half. 

The seven dimensions emerged from multiple iterations and numerous discus- 
sions, with the aim of being usable for both the design and analysis of scenogra- 
phies, addressing all aspects of the audience and musicians experience through the 
technical choices of performance setup. 


13.4.1 Performers Transportation and Spectators 
Transportation 


Performers Transportation and Spectators Transportation relate to the manner in 
which performers and spectators are immersed in the virtual musical environment, 
and to the extent to which the virtual and physical spaces intersect in a meaningful 
performative fashion. In particular, it indicates if the virtual stage is integrated in 
the physical space (or if it is surrounding it) and whether the setup is adequate to 
play/showcase the chosen IVMI. 

It includes the following (non-exhaustive) range of technological settings to dis- 
play the VE: 


a single monoscopic (2D) screen 

a volumetric display in the centre of a physical stage 

a mobile/handheld augmented-reality display 

a stereoscopic screen without and then with head-tracking 
a CAVE or set of stereoscopic screens 

an augmented-reality headset 

a virtual reality headset 


13 From the Lab to the Stage: Immersive Virtual Musical Instruments 401 


Ensemble 
Potential 


Interaction Spectators 
Spectrum Awareness 


Spectators Performers 
Visibility 


Visibility 


Performers Spectators 
Transportation Transportation 


The sound of one hand (1992) 


Ensemble 
Potential 


Interaction Spectators 
Spectrum Awareness 


Performers 


Spectators paies 
Visibility 


Visibility 


Performers Spectators 
Transportation Transportation 


Drile (2011) 
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Resilience (2019) 
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The Reggie Watts Experience (2016) 


Fig. 13.3 Dimension spaces for the analysed stage setups 


Beyond visual displays, transportation also applies to auditory feedback (ranging 
from a monoscopic speaker to ambisonics and binaural spatialisation) and to hap- 
tic feedback, including passive solutions (like the grips on the physical controllers 
required to play the IVMI) as well as proper actuators (ranging from a small vibrotac- 
tile wearable to exoskeletons for large-scale kineasthetic feedback). While targeting 
the enhancement of the feeling of presence within the VE may help, to achieve a 
high Performers Transportation these technologies have to be combined to allow 
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the musician to play their [VMI on stage with no extra effort, compared to practice 
sessions carried out in a dedicated studio/lab space. Likewise, the ultimate scope of 
the Spectators Transportation dimension is to quantify to what extent the proposed 
musical experience feels real, and whether the display of musicianship is perceived 
as genuine as in the case of a traditional concert setting. It should be clear that the 
transportation dimensions are linked to, but do not overlap with, the constraint of 
immersion. They specifically highlight how the physical and virtual spaces intersect, 
similarly to what was proposed by Benford et al. in the context of shared VEs [6]. 
Although partially accounting for the need to feel present in VE, these dimensions 
incorporate all the stage performance requirements discussed in Sect. 13.2.1 and 
extend them to the domain of virtual worlds. The very term “transportation” was 
chosen to emphasise the focus on music, which is deemed capable on its own of 
psychologically transporting audiences into narratives, stories and fictional worlds 
[28, 87]. 

Transportation deeply affects both the audience’s and musician’s experience, yet 
in different manners. As a result, its measure tends to be highly asymmetrical. A 
straightforward example may be a scenography where the spectators are wearing 
VR headsets while the performer uses only a monoscopic screen—or the other way 
around as seen in The Sound of One Hand. Such a difference between the experience 
of the two stakeholders may not always be detrimental. For example, it is hard to 
imagine high Performers Transportation in the absence of interaction. Nonethe- 
less, VEs that do not include interactive 3D objects, but are capable of physically 
reaching and surrounding the audience, sensibly enhance the transportation of the 
spectators [95]. More in general, HMDs, single-user projections, head-tracking and 
active/passive haptic feedback are all elements capable of granting the level of trans- 
portation required by the performer to master the instrument and play it on stage; yet, 
their use may exclude the audience from the VE and cut direct communication with 
the performer, unless the Spectators Transportation level is comparable. This trans- 
lates into strong crossovers between transportation and other dimensions, such as 
Performers Visibility, Spectators Visibility and Spectators Awareness. For instance, 
VR headsets, which likely result in a high transportation value, impose a mediated 
view of musicians and spectators, e.g., with a 2D or 3D live-capture integrated into 
the VE, which in turn may reduce their visibility. On the other hand, with a low 
transportation level for the audience, their awareness might be constrained by the 
impossibility to visually align the virtual components of 3D interaction techniques, 
e.g., a virtual ray, with the physical hands of the musician. 


13.4.2 Spectators Awareness 


This dimension describes how well the audience perceives the virtual and physi- 
cal interactions performed by musicians on the virtual instrument, i.e., the relation 
between their gestures, the instrument and the resulting changes in the sound. 
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It can be low, for example, when a technique such as virtual ray is used for the 
selection of distant parts of the instrument, but this ray is either not visible at all to 
the audience or not visually co-located with the performer’s hand. It can also be low 
if some physical interactions, e.g., with physical sensors, are not visually reflected 
in the VE and the Spectators Transportation dimension is high. 

The problem of abstract interaction is not unique to IVMIs, yet its occurrence 
is intensified by the employment of immersive technologies. Much like the case of 
IVMI performances, spectators are often incapable of fully grasping the workings of 
non-immersive DMIs, nor a causal relationship between action and music. As a result, 
the performance runs the risk to become opaque [14, 33], even confusing [27, 89], 
reducing the attributed agency [13] for the audience and in turn potentially degrading 
their experience [23]. DMI research suggests that this is due to the very metaphor of 
the instrument [33], as designs favouring intellectual and cognitive skills (e.g., live- 
coding environments, algorithmic devices) prove more prone to trigger abstraction 
compared to those leveraging familiar physical gestures [37]. 

When Spectators Awareness is low, the articulation between perceived manipu- 
lations, i.e., gestures and interaction techniques, and effects, i.e., controlled sound 
parameters, is not visible enough and there is a risk of [VMIs being seen as secretive 
or magical instead of expressive [74]. 

In some cases, a breach into the virtual world is provided by means of screens 
that display the point of view of the performer. This solution may help the audience’s 
understanding of the performance. Nonetheless, as explained in depth in Sect. 13.5, 
much is still left to imagination and interpretation, the reason being that the [VMI 
is made visible but to the eyes of the spectators is not immersive (i.e., it does not 
surround the audience, nor the performer). 

In cases where the transportation has a different value for spectators and perform- 
ers or if the interaction techniques are too subtle or too complex, it is also possible 
to provide dedicated visual representations of the interactions for the audience. The 
design of these representations should however be chosen carefully. A correct bal- 
ance needs to be targeted, between too little information, which results in a degraded 
subjective comprehension and potentially degraded experience [23], and too much 
information, which can lead to perceptual and cognitive overload. In the case of indi- 
vidual VR headsets or shared views of the VE for the audience, this level of detail 
can be interactively chosen by spectators [23]. 


13.4.3 Performers Visibility and Spectators Visibility 


Performers Visibility and Spectators Visibility correspond respectively to the level 
of perception of the musician(s) by the audience and to the level of perception of the 
audience by the musician(s). It may take the following (non-exhaustive) values: 


e not visible at all 
e partially (from behind, from the side, with occluded parts) 
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e seeing fully in a simplified manner 
e seeing a detailed 3D reconstruction or facing the physical performer 


This dimension has a strong impact on the performer—spectator non-verbal com- 
munication. 

Many commercial IVMIs and frameworks for immersive performances make use 
of avatars to represent the spectators. These are usually simple and can be chosen 
by users. They will therefore range between a medium and medium/high visibility 
levels depending on the level of detail provided on the appearance, behaviour and 
reactions. In these setups, performers often have more detailed or more expressive 
visual representations than spectators. 

In a setup with lower Performers Transportation or Spectators Transportation, 
i.e., where the [VMI is integrated in the physical space, the physical spectators and 
performers can be seen more clearly if they are facing each other, with the instrument 
displayed between them. 


13.4.4 Ensemble Potential 


The Ensemble Potential dimension describes the ability for the scenography to 
accommodate multiple [VMIs or performers. 

It is low when the setup only affords a single performer, for example because a 
head-tracked stereoscopic display is used or because the virtual environment was 
designed to host a single instrument or performer. 

Depending on Performers Transportation and Performers Visibility, a high 
Ensemble Potential means either that the physical space can accommodate mul- 
tiple performers collaborating on the same or with different instruments, or that the 
VE allows for displaying and/or navigating in 3D amongst multiple IVMIs. 

Scenographies with a high Ensemble Potential should also ensure a correct co- 
presence [69], for example with high values of Performers Visibility and can provide 
access to a variable number of collaboration modes [9]. This dimension also strongly 
relates to the inter-actors and distribution in space dimensions used as part of the 
dimension space proposed by Birnbaum et al. [15]. Inter-actors describe the number 
of musicians while distribution in space specifies how the instrument extends in the 
physical space, ranging from a small device to a networked instrument. Ensemble 
Potential integrates both aspects, since IVMIs can virtually expand to integrate all 
musicians in a single shared VE. 


13.4.5 Interaction Spectrum 


This dimension describes what range of interaction techniques is permitted by an 
IVMI performance setup. These include the three categories of 3D interaction tech- 


13 From the Lab to the Stage: Immersive Virtual Musical Instruments 405 


niques described above, to which we add physical manipulations, i.e., musical inter- 
actions performed in the physical rather than virtual space. 

3D selection techniques enable various types of musical gestures [21]. Although 
they would typically be associated with selection gestures, i.e., picking a component 
of an instrument, they can also serve as excitation, i.e., generating sound, or mod- 
ulation gestures, i.e., changing the properties of the instrument. In fact, entering an 
object with a virtual ray may be used as an instantaneous excitation gesture to trigger 
a note, as done by Maki-Patola et al. [56]. It can also be used for continuous excitation 
when, for example, dragging a virtual cursor across the surface or inside the volume 
of virtual objects. In the context of public performances, selection techniques based 
on virtual rays require a high continuity (from physical hand to virtual ray) to be 
understandable by the audience, while virtual hands/cursors might be more tolerant 
(they are by definition not co-located when doing distant selection) and image-plane 
selection requires no (non-co-located) visual feedback. 

3D manipulation techniques can be used for both excitation and modulation musi- 
cal gestures. Spatial transformations offer not only continuous controls from the 
changes in position, orientation and scale but also discrete controls, which can be 
used, i.e., as instantaneous excitation gestures, from collisions and intersections. 
Modification of appearance and shape can also serve as modulation gestures. For 
example, the tunnels of the Drile instrument [10] are 3D sliders which allow musi- 
cians to set the graphical parameters and associated sound parameters of 3D nodes 
of musical hierarchical structures. [66] proposes virtual sculpting as a way of setting 
musical parameters associated with the shape of a 3D mesh. Manipulation techniques 
can be distant, e.g., with 3D tools, or co-located, e.g., in the case of virtual sculpting. 
In both cases, however, the musician’s actions and the causal link between manipu- 
lation and musical result [13] are made visible to the audience by the visual changes 
in manipulated objects on which the focus is put. Therefore the lack of real—virtual 
continuity in manipulation techniques might not affect the spectator experience as 
much as in other interaction categories. 

3D navigation techniques can be used for most types of musical controls. Processes 
and parameters can be discretely selected before modification by entering associated 
volumes, such as the virtual rooms used in Drile [10]. Modulation of musical param- 
eters can be achieved through displacement in parameter spaces, either continuous 
with general movement techniques or discrete with targeted movements. In the same 
manner, excitation gestures can be achieved by mapping the relative position of vir- 
tual objects to the volume of associated sound processes, as done in Versum [3]. 
The impact of real—virtual continuity in the audience experience of 3D navigation 
depends very much on the granularity of the musician’s position mapped to musical 
parameters. If the mapping is done according to the musician’s movements within 
the space physically navigable, meaning that the user can physically walk to move 
through it, the audience understanding of the musician’s impact on the sound will 
require a high level of real—virtual continuity. However, if the navigation moves this 
physically anchored space in the VE, then the performed action is directly visible 
to all spectators from changes in the environment only and real—virtual continuity is 
not as necessary. 


406 V. Zappi et al. 


Physical interactions constitute another category of interaction which can be made 
available by a specific scenography, and corresponds to controls performed in the 
physical space, e.g., on a control surface or an acoustic instrument. In order not 
to degrade the Spectators Awareness, these controls also need to be represented in 
the VE using changes in the performer’s avatar or in the instrument appearance for 
example. Physical controllers and instruments can also be captured and rendered 
inside the VE. 


13.5 Case Study: Analysis of [VMI Performances 


In this section we use the seven IVMI scenography dimensions to analyse differ- 
ent performances and discuss their setups. This allows for practical observations on 
scenography and their possible variations. The performances are introduced chrono- 
logically: the section tries to give a sense of evolution and change of the medium 
over time, both in terms of ideas, implementation, technology and diffusion. Perfor- 
mances have been selected giving precedence to pioneering solutions, and preferring 
well-documented acts, both in the literature and on the web at large. 

A visual representation of the analysis is given in Fig. 13.3, in the form of a 
dimension space that provides a quick overview of each performance’s properties. 
As mentioned in Sect. 13.4, it is structured both vertically and horizontally in order 
to provide a quick idea of the distribution of a scenography, between interaction and 
immersion and between spectators and performers. 


13.5.1 Approaching a Performance Analysis 


The analyses featured throughout this section start by dissecting the essential aspects 
of each performance. The main objective at the beginning of the process is to isolate 
the atomic components defining the stage setup, the IVMI, its use and the expected 
behaviour of the performer(s) and audience. If the venue has some other peculiarities, 
it is also helpful to address them at this stage. As an example, the following are all 
valid questions which arise when starting the analysis process: Are there HMDs 
involved? Who is wearing them: the audience, the performer(s) or both? Is there a 
screen dedicated to the audience, how is it oriented? Is it hiding the performer from 
the audience, or vice versa? Beyond the visual aspect, other important questions 
inquire about the performance itself: How many performers are playing? How do 
the virtual and real instruments used on stage work? Are they easy to understand, or 
hindered by some design choice or technical limitation? Finally, our focus may shift 
to the location: Is everyone in the same physical location, or does the performance 
setup involve some form of telepresence? How good is the continuity between virtual 
and real elements on stage? What about the venue, seen from the performers’ point 
of view? Therefore, the first part of the analysis consists of making a list of all 
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the prominent bits which make the performance that performance: the resulting 
summary is not necessarily a technical survey. On the contrary, it can be interpreted 
as a synopsis of the IVMI and its stage setup, from where the actual constraints will 
emerge. The outcome of this summary is a quick reference to consult when evaluating 
the seven dimensions. 

Once the fundamental pieces of the performances (and their setups) have been 
identified and summarised, it is possible to start discussing how they fit within the 
dimension space as a whole. As a preference, analyses here presented start by address- 
ing transportation. Performers Transportation and Spectators Transportation act 
as a solid ground for building the rest of the evaluation: as specified in Sect. 13.4.1, 
they can easily influence other dimensions. They encompass multiple feedback chan- 
nels (visual, auditory and haptic), even though they tend to gravitate towards visual 
feedback, which has typically a heavy impact on presence. Available technologies 
are also affecting this bias towards visual immersion. Nonetheless, other channels 
should be considered carefully when investigating these dimensions. After evaluat- 
ing transportation, it is reasonable to consider awareness and visibility dimensions, 
which also depend on multimodal feedback. Finally, the remaining dimensions can 
be addressed prioritising their prominence and importance within the performance. 

Plotting the dimensions is a process involving a subjective judgement, especially 
when it comes to choosing the exact values used to generate the dimension space. 
Nonetheless, the seven dimensions are designed to highlight the asymmetries and 
relationships existing between the different aspects of a performance. Such con- 
straints exist independently from the chosen numerical values: this is where a care- 
ful analysis potentially moves from being mostly subjective to being descriptive of a 
set of existing relationships. Certain technological setups are currently intrinsically 
incapable of providing, for example, high transportation both on and off stage, as-is. 
HMDs tend to hinder visibility, projected screens can break continuity between vir- 
tual and real elements, thus affecting transportation, and so on. So, the descriptive 
viewpoint provided by an accurate dimension space is of great interest despite the 
exact values used to create the plot. A reliable [VMI stage overview can be used not 
only to understand and analyse an already staged performance, but also to monitor 
and guide the design of a new one. The performance designer(s) could address early 
on some limitations, e.g., if Spectators Visibility and Performers Transportation are 
both considered important for a certain performance, a real-time video or point cloud 
representing the audience could be used to improve Spectators Visibility when the 
performer is wearing a HMD while maintaining high Performers Transportation. 

After having carefully populated the dimension space, an additional final step is 
that of finding a one-sentence description of the analysed performance. In this section 
these short descriptions can be found right at the beginning of each analysis. This 
final touch has at least two objectives: it implies a review of the analysis process and 
it guides the future reader by highlighting the performance core values. 
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13.5.2 The Sound of One Hand 


Pioneering performer’s immersion for fine control. Jaron Lanier’s The Sound of One 
Hand [49] was performed for the first time at SIGGRAPH in 1992. 

Multiple virtual instruments are used during the performance, with Lanier playing 
them in turn. Sounds and notes were generated by hand movements, as they were 
transmitted to the instruments using a Data Glove. The Data Glove was also used 
by the musician to move and reach the instruments, which were sparse all around 
the VE. Instruments are described by the musician as autonomous and sometimes 
fighting back. Lanier talked in detail about these instruments, addressing how they 
are created and also how they take inspiration, visually and sonically, from real- 
world instruments.’ A head mounted display (HMD) was used by the artist in order 
to immerse himself in the VE, and therefore access the virtual instruments. This 
creates a setting where the musician can clearly be seen by the audience throughout 
the whole performance. On the other hand, it is impossible for the performer to see 
the audience. On stage, next to the performer, a screen was used to display a 2D 
projection of his point of view. This grants the spectators an access to the VE. 

The primary dimension of this stage setup is Performers Transportation: the use 
of the HMD allowed the musician to perceive a consistent world all around him 
and to have access to fine 3D controls. However, the use of the HMD leads to the 
absence of Spectators Visibility. Conversely, Performers Visibility is quite high: 
Lanier played on stage, right in front of the audience, but he was also free to move 
and rotate, yet partially hiding his gestures. Spectators Awareness is limited since 
the VE and the musician were perceived by the audience as two completely separated 
elements, the former projected onto a screen, the latter moving on the physical stage, 
with no continuity between the two. Furthermore, the screen projection was 2D and 
it displayed the musician’s point of view, resulting in an extremely low level of 
Spectators Transportation. 

The Interaction Spectrum includes 3D manipulation and navigation. Lanier opted 
for a point-flying navigation technique, a choice motivated by the artist’s will to have 
an unconstrained and skilful way to explore the VE. 

The scenographic level of this pioneering setup is understandably constrained, 
and it mainly focuses on the musician and his interaction with the VE. About the 
instruments, Lanier himself states that “They emerged from a creative process I 
cannot fully explain”, and describes them as not immediately understandable, and 
also difficult to play. However, showing spectators a 3D projection aligned with the 
physical position of the musician on stage would remarkably enhance the audience’s 
experience, providing immersion and increasing gestures continuity. 

An interesting note: according to Lanier’s impressions, the asymmetry between 
performers and audience visibility resulted in him feeling vulnerable on stage—as 


7 http://www jaronlanier.com/instruments.html 
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opposed to what might happen when using rare and expensive technology for a 
performance. So, for the musician, adding this combination of dimensions generated 
a “more authentic setting for music”. 


13.5.3 Virtual_Real 


Intense audience experience. The Virtual_Real performance [95] was born from a 
collaboration between Victor Zappi and the electronic composer USELESS_IDEA. 
It took place three times in Genoa, in 2010. 

The performance was set up inside a laboratory room, which acted as an inti- 
mate venue. At the centre of the stage, the musician could play standard hardware 
controllers available in front of him. A single screen was positioned on stage, at his 
back. The screen displayed to the audience stereoscopic images of VEs populated by 
3D objects, acting as both instruments and visuals. Thanks to optical motion capture 
the performer could move, touch or morph these virtual objects, in order to control 
audio effects. Thus, the setup allowed the musician to play both standard hardware 
controllers and non-immersive virtual instruments in front of the audience. 3D visu- 
als and control algorithms were designed, tested and modified based on the artist’s 
input and ideas. USELESS_IDEA played five tracks specifically composed for the 
event, each associated with a different 3D choreography. 

Hardware controllers used by the performer included a laptop, a MIDI controller 
and a small mixer. The artist’s dominant hand was tracked using passive reflective 
markers, allowing him to trigger interactions with the VE. The immersive content 
was designed to be experienced by the audience: despite the impossibility to provide 
head-tracking for each spectator, the proportions between the projected screen size 
and the room size allowed the small audiences of nine spectators to enjoy a shared 
viewpoint, with no significant visual distortions. The audience could thus experience 
a stage where the performer, real items and virtual elements shared the same space 
(Fig. 13.4). 

This performance is strongly focused towards providing an intense audience expe- 
rience. As a consequence, transportation is highly asymmetrical, with immersion 
affecting the spectators exclusively. The VE and its virtual instruments are perceived 
by the audience as coherently superimposed with the physical stage. This leads to a 
high Performers Visibility. Performers Transportation is absent since the musician 
faces the audience and not the screen, while Spectators Visibility is high. Spectators 
Awareness is positively influenced by the possibility of clearly seeing the performer 
interacting with both real and virtual instruments. The musician’s physical interfaces 
provide the same interaction transparency which could be expected in a traditional 
electronic music performance. Virtual instruments were coherently rendered with 
the audience’s point of view, and the performer could be seen manipulating them. 
The sonic and visual results of such interaction were designed to be easily perceived. 
The Interaction Spectrum mainly included 3D manipulation techniques, with the 
performer moving and dragging objects around the VE scenes. 
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Fig. 13.4 USELESS_IDEA performing Virtual_Real, 2010. The shot frames two spectators wear- 
ing stereoscopic goggles, required to fully appreciate the hybrid virtual/physical stage 


This single-screen setup can create a strong involvement in the audience: virtual 
choreographies can be really convincing, and non-verbal communication with the 
performer can be really close to what would happen on a traditional stage. However, 
such an extremely audience-centric setup makes it impossible for the musician to use 
complex and potentially more expressive 3D interaction paradigms, thus limiting the 
possibilities of the virtual instrument. Slight setup modifications could generate a dual 
experience, in which the screen projection is dedicated to the performer, completely 
changing the scenographic outcome. The audience would no longer enjoy the perfect 
virtual/real environments consistency, while the musician would be immersed in the 
instrument, allowing for fine audio control. 


13.5.4 Drile 


Immersion for both ends. This performance was executed by Florent Berthaut in 
Bordeaux, 2011. The Drile instrument [10] used throughout the performance allows 
a musician wearing a stereoscopic goggle to execute live-looping in a 3D immersive 
environment. The performer uses handheld devices with pressure sensors in order to 
reach, excite and modulate the musical objects populating the environment. These 
objects are associated with the nodes of hierarchical live-looping trees, and their 
manipulation allows the musician to create and handle loops. Virtual rays are used 
to select and interact with the virtual objects. 

Drile was shown on stage, thanks to stereoscopic projections. Two screens, jux- 
taposed, were positioned on stage, with an angle between them. One screen was 
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Fig. 13.5 Florent Berthaut performing with the IVMI Drile, 2011. The shot is taken from the seat 
area and shows the 3D musical environment being pierced by the green virtual rays cast by the 
performer 


exclusively facing the performer, sideways. The second screen was rotated so that 
it could face the audience. This arrangement had the screens defining an enclosed 
volume on stage. Therefore, both audience and performer perceived Drile as an 
instrument “contained” inside this volume (Fig. 13.5). A correct perspective was 
granted to the performer by means of head-tracking, while a shared viewpoint was 
used to display the stereoscopic content on the audience screen. 

This performance gives a highly symmetrical experience to the audience and the 
performer. Transportation is medium, since both parts can properly perceive the vir- 
tual instrument and the real stage while the virtual space is literally contained within 
the physical space. Spectators and performers visibility dimensions are quite high, 
since musician and spectators could directly see each other. Spectators Awareness is 
good, but hindered by the distance between the performer and the screen: virtual 
rays shown within the VE indicated which virtual objects the musician was manip- 
ulating, yet the instrument was operated standing one or more meters away from the 
screen. This distance breaks the continuity between the performer’s hands and the 
virtual rays. The Interaction Spectrum relies on virtual rays for the selection and 
manipulation of the 3D musical elements, but without physical manipulations. 

This performance setup provides proper immersion for the audience and the per- 
former, resulting in a great scenographic outcome and potential. Having a correct 
perspective for both parts allows the musician to have fine control of the instrument, 
and the audience to have a meaningful understanding of his actions. An alternative 
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version of this setup could use a bigger, transparent screen dedicated to the audience. 
This screen would be placed between the spectators and the musician, allowing to 
overcome the absence of continuity between the performer’s hands and the virtual 
rays shown in the VE. Since the musician’s head and hands are tracked, additional 
visual effects and feedback solutions dedicated to the audience could be designed. 
This though could negatively impact Performers Visibility, and should be carefully 
implemented to avoid a negative outcome on Spectators Awareness. 


13.5.5 The Reggie Watts Experience 


A truly shared experience. This setup is based on the possibilities offered by social 
VR platforms. Users wearing a headset can share a virtual space, and interact with 
each other through 3D avatars. The performer Reggie Watts has been a recurring host 
of shows taking place specifically within the AltspaceVR platform, since 2016. His 
shows have been labelled as The Reggie Watts Experience and the performer keeps 
exploring the possibilities given by the format to this date. 

Both the audience and the performer wear an HMD, which allows them to share 
the same virtual space and see each other. Reggie Watts movements on stage can be 
tracked, thanks to full-body motion capture. He can use a microphone, controllers 
and effects, close to what he might do on a real stage. This kind of setup allows him 
to dance in front of the audience, see and address participants and move around the 
entire venue. The appearance of the avatars, venue and visual effects used throughout 
performances is designed to match the overall stylised aesthetic of AltspaceVR. 
Regarding the venues, different virtual spaces have been created and used, thanks to 
the possibilities given by the platform. Sometimes, visual effects can be seen, such as 
virtual fireworks, and simple, moving shapes. Tracking is available for the audience 
as well, based on the setup they have access to. 

While wearing his HMD and tracking system, the performer can still interact 
with his own instrumentation, which is sometimes represented on the virtual stage 
by simple 3D models. AltspaceVR provides a tool which allows to host multiple 
instances of the same venue so that countless number of spectators can participate at 
the same time. Each instance can host ten participants, meaning that each member 
of the audience can be close to the stage. Spectators and the performer only see a 
limited part of the total number of participants currently present at the virtual venue. 
The completely virtual environment allows Reggie’s voice and instruments to be 
spatialised so that as he moves throughout the venue, it is clear to the audience where 
to look for him. 

These performances focus on the idea of a shared space, and transportation is 
strongly symmetrical: both the audience and the performer are immersed in the VE as 
if they were physically present at the same venue. Performers Visibility is high, even 
when a huge audience is participating, thanks to the possibility of having multiple 
instances of the same performance, each hosting a limited number of spectators. 
Spectators Visibility is high, but only for those spectators which are in the same 
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venue instance of the performer. So, from the performer’s point of view, spectators 
are either really close and visible or not present at all. Spectators Awareness is 
limited to what can be understood from Reggie’s limbs and body movements. Thus, 
audience experience mainly relies on his voice, music, posture and dancing. The 
presence of 3D models of his gear mitigates the limited awareness, for those cases 
in which the performer interacts with physical instruments. His avatar can in fact 
be seen bending over the controllers, making it easier to understand his posture in 
those particular moments. Regarding the Interaction Spectrum, virtual instruments 
are absent. 3D navigation is possible for the performer and affects the spatialisation 
of sound, but it’s not used to interact with virtual instruments. 

This performance allows a direct communication between the audience and the 
performer. Reggie can address his spectators and interact with them. The possibil- 
ity of seeing the performer moving, dancing and posing in the VE could be further 
explored, though. No virtual instruments are present, so the potential of the setup 
used to stage this performance is not completely explored yet. Virtual instruments 
could be added, which might be a way to create an even more compelling experi- 
ence. Both the audience and Reggie share the same environment, and no perspective 
issues are present: this can overcome part of the limitations seen in more asymmet- 
rical performances, and would allow a less constrained interaction design for virtual 
instruments. Nonetheless, the immediacy of having only the performer on stage has 
its own advantages: going to the extreme opposite could be detrimental to Spectators 
Awareness, and also negatively affect the spontaneous feel of the performance. 

The Reggie Watts experience is part of a set of immersive performances and virtual 
instruments which are exploiting the growing diffusion of consumer virtual reality 
setups. A variety of platforms is being developed, each addressing different scenarios: 
immersive music making, remote participation to live events, VR dance clubs and 
so on. Electronauts is a VR instrument for beat making and jamming. The creators 
also showcased an augmented/mixed reality video of a session where a performer 
playing the Electronauts instrument jams along with other musicians (guitar and 
sax drums). AltspaceVR is providing a platform for performers like Reggie Watts 
to create shared musical experiences, and other companies aim to provide similar 
setups. MelodyVR allows to capture and share immersive videos from live concerts, 
which can be experienced on a VR HMD. Online multiplayer videogames such as 
Fortnite have been used to host musical performances. Even if not immersive for 
audience or musicians, such endeavours show a growing interest in the exploration 
of novel possibilities in the field of virtual musical performances. 


13.5.6 Resilience 


A laptop orchestra with a VR Conductor. This performance is designed for a laptop 
orchestra and one VR performer/conductor. Resilience [2] was performed in June 
2019, at the 2019: A SLOrk Odyssey concert at the Bing Concert Hall of Stanford 
University. 
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Fig. 13.6 Resilience immersive musical performance, 2019. The conductor stands at the centre of 
the stage, wearing an HMD and leading the orchestra via both hand gestures and 3D interaction. 
Image courtesy of Ge Wang, Stanford Laptop Orchestra 


The VR performer is at the centre of the stage, wearing an HMD and acting 
as a conductor. Surrounding him, eight performers are positioned on two separate 
rings. The performer’s hand movements are tracked by handheld motion controllers, 
while the rest of the ensemble has access to tether controllers. Each performer has 
a laptop and speaker array. The VR performer is facing away from the audience, in 
the direction of an oversized projection screen. Thus, the audience has a view of the 
conductor, the orchestra and a 2D projection of the environment experienced by the 
VR performer (Fig. 13.6). 

The performance was structured in three movements, with the VR conductor cue- 
ing the orchestra throughout the piece with his body movements. By using motion 
controllers, he sometimes also triggered flashes of lightning. The orchestra mem- 
bers used tether controllers to affect the movements of virtual seedlings and their 
visual aspect, and excite synthesised sounds. The way the performer’s movements 
were acted out, and the timbre of the synthesised sounds changed with each piece 
movement. The whole ensemble at times also acted as a whole meta-instrument, per- 
forming wave gestures which were paired with movements of a wind timbre across 
the ensemble. When this happened, the virtual seedlings changed their direction 
accordingly. During the entire performance, the point of the projection shown to the 
audience was curated by the head movements of the conductor, which the creators 
have thoroughly evaluated and rehearsed. The same 2D projection was rendered on 
small monitors available to the orchestra performers. 

In terms of fruition, this performance provides different experiences to the audi- 
ence, conductor and orchestra. Performers Transportation is high for the conductor, 
who is immersed in the VR environment, while the orchestra only experiences the VE 
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on a small monitor. The conductor entirely misses the real stage, which conversely is 
the main space experienced by the other performers. Because of this, overall Perform- 
ers Transportation can be considered of medium level. Spectators Transportation is 
limited: the stage is clearly in front of the spectators, while the virtual environment 
is displayed on a screen from the conductor’s point of view. Performers Visibility is 
high for the audience and the orchestra, while the conductor can only perceive the 
virtual environment. Spectators Visibility is high for the orchestra, and absent for 
the performer. Spectators Awareness is positively influenced by the clearly visible 
choreographed movements of conductor and orchestra, which are affecting the sonic 
outcome of the piece and the visuals of the virtual environment, thanks to the tethered 
controllers. This performance Ensemble Potential is good, as the piece is designed 
for a conductor and orchestra. Nonetheless, co-presence is limited for the conductor, 
who cannot perceive their own orchestra if not sonically. 

Resilience could be described as a carefully planned laptop orchestra piece with 
live visuals, featuring the addition of a conductor immersed in a VR environment. 
Audience access to the VE is provided through a 2D projection, curated by the 
conductor in real time. This can be used as an expressive channel, to the expense 
of audience immersion, which could otherwise be improved by introducing a stereo 
projection with shared point of view (see Virtual_Real and Drile performances). 


13.6 Towards the Design of Novel Scenographies 


The rapid growth of consumer and professional VR technologies is offering new 
interesting perspectives to [VMI designers and performers. As hinted by the analy- 
ses presented in the previous section, a fair amount of stage setups have been explored 
over the last 30 years, leading to extremely different experiences and related dimen- 
sional spaces. Nonetheless, there is still much to experiment with and discover. Every 
year, technologies that once were seen only in research laboratories or during spe- 
cialised scientific events become available in public and entertainment spaces, and 
some even populate the shelves of electronics stores. Some examples are the large 
immersive multi-projection systems now found in several museums and performance 
spaces, as well as the first wave of see-through headsets that hit the market just a 
few years ago. While facilitating the design of more advanced and more daring vir- 
tual experiences, these technologies embrace specifications that make them more 
and more compatible with digital media and—in particular—audio standards.® As a 
result, the creative horizons of VR musicians keep expanding, thrust by the embed- 
ding of devices, materials and arrangements that had never before been available to 
convey musical expression in live settings. 


8 To corroborate this argument, we would like to point the reader to the beautiful Alt F performance 
showcased at NIME 2021, during which DMIs well-known to old-time attendees of the conference 
were played for the first time in a fully immersive online environment! 
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In this section, we take the liberty to suggest three solutions that propose unique 
takes on the virtual musical experience and that, to our knowledge, are yet to be 
explored. It is important to remark that we are not going to describe scenographies 
per se, though. The technological and spatial composition of a virtual musical perfor- 
mance depends necessarily on both the instrument and the stage (Sect. 13.1.1), and 
refers to a precise instance (or a series of instances) of the show. As opposed, now we 
are about to discuss the use of immersive technologies in precise stage arrangements 
that encompass performers and spectators, yet without focusing on any specific [VMI 
or performance. In this context, the dimensions allow us to carry out an analysis of 
the potential of these stage setups, in terms of their ability to accommodate various 
categories of musical instruments and to create impactful scenographies for/with 
them. At this point, it should also be clear to the reader that no solution is perfect 
and the setups we are about to introduce are no exception. 


13.6.1 Co-located Antithetical Immersion 


We start with something relatively easy to achieve, at least from a purely techno- 
logical perspective. Let us consider what appear to be two antithetical immersive 
solutions, in particular those used in The Sound of One Hand and Virtual_Real 
(Sects. 13.5.2 and 13.5.3). The former features HMD and Data Glove to give the per- 
former full access to the VE (high Performers Transportation and wide Interaction 
Spectrum), though limiting Spectators Transportation and Spectators Awareness; the 
latter leverages exo-centric 3D projections that convincingly merge virtual and phys- 
ical world to the eyes of the audience (remarkable Spectators Transportation and 
Spectators Awareness), at the cost of Performers Transportation and Interaction 
Spectrum. Although often used separately (e.g., [36, 90]), these immersive setups 
can be combined to balance out most of their individual shortcomings. 

In practical terms, what we envision is a stage where headsets are used by per- 
formers” and stereoscopic projections are designed for the audience. This scenario 
co-locates on the same physical stage immersive technologies that differ in structure 
and target, allowing to display the VE and interaction from two distinct perspectives 
simultaneously: the one of the performer (as rendered on the HMD) and the one 
of the audience (as rendered on the screen). This permits to reach high values of 
transportation for both performer and audience, and strong Spectators Awareness. 
Furthermore, such a setup provides access to all 3D interaction techniques, making 
it compliant with a variety of IVMIs and leading to the design of scenographies 
characterised by a broad Interaction Spectrum. Unfortunately, the use of headsets 
makes the visibility dimensions quite asymmetrical, yet scales quite well in case of 
ensemble performances. 


° As discussed in Sect. 13.3.1, headsets combine an HMD with tracking and input devices. 
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13.6.2 Augmented Workspace and Spatial Paradox 


The second setup that we propose promotes a rather “unorthodox” experience of 
the space that performer and spectators share. Right at the beginning of this chapter 
(Sect. 2), we mentioned the possibility to play with the scale of the VE in paradoxical 
ways, the most common example being virtual instrumentation that exceeds the 
physical size of the stage (e.g., [49]). Now we take a step in the opposite direction. 
We describe a solution to make music with a virtual world in miniature, that can fit 
the hands of the performer, but is still capable of surrounding a full audience! 

The inspiration for this concept comes from artist Hicham Berrada and his work 
Présage. Berrada filmed a 360 view of the inside of a small water tank, while pouring 
coloured chemicals into it; he then scaled up the video to fit a large multi-projection 
installation, where spectators could experience a stroll at the bottom of the lively 
tank. Our take on this setup replaces the water tank with a medium to small-sized VE, 
populated with musical objects and embedded in the performer’s workspace via an 
AR headset (like the Microsoft HoloLens). In a separate room, the audience is hosted 
inside an immersive stereo-projection system; here, the same VE is scaled up of two 
or more orders of magnitude and rendered as if the seat area/parterre were inside of 
it, facing the performer. Furthermore, stereo-cameras can be easily installed on both 
sides of the setup so that the AR workspace could include a miniature volumetric 
render of the audience and the stereo-projections could showcase the titanic body 
and gestures of the performer.!° The result is a paradox, a non-existing shared space 
where musicians, spectators and virtual objects can be huge or tiny, depending on 
the beholder. 

This unusual setup may support the design of scenographies that excel in most 
dimensions, in particular those pertaining to transportation and visibility. The main 
drawbacks may though come in terms of low Ensemble Potential and Interaction 
Spectrum. Indeed, sharing the AR workspace between more than two performers 
may reveal problematic while the overall spatial design aligns well only with specific 
interaction techniques and IVMI metaphors. 


13.6.3 Double-Sided Virtual World 


We conclude our review of proposed stage setups with a technically challenging yet 
visually impressive solution. The aim of this last entry is to employ a single VR/AR 
technology to immerse both performer and spectators, while they are physically 
present in the same venue. By leveraging the Pepper’s ghost effect, an acrylic semi- 
transparent screen can be set up to obtain a double-sided reflective surface that splits 


10 The seat area maybe even virtually embedded in one of the movable parts of the instrument, 
turning the execution of a piece into a lively ride for the audience; this type of extremely embodied 
musical experience would likely cause a high occurrence of motion sickness, but who are we to 
define the boundaries of artistic expression? 
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the stage from the seat area and forms two distinct windows on the virtual world— 
one for the musician and one for the audience. The screen has to be installed at 
the edge of the stage with a 45 degree horizontal tilt so that one of its sides leans 
towards the spectators. Then, two projection surfaces are placed above and below it; 
projections reaching the top surface are reflected on the audience’s side of the semi- 
transparent screen, while projections directed to the bottom surface are reflected on 
the musician’s side. Such a setup minimises interference between the two reflections 
so that both performer and audience can use the screen to have a clean stereo-view 
of the VE, from their own perspective. The way the VE is rendered on the two sides 
may even differ in level of details or content! Furthermore, portions of the screen not 
reflecting any light maintain their see-through nature. This allows to include physical 
props within the VE or, vice versa, to augment traditional musical gears with virtual 
widgets. 

In our 2014 work [12], we described a prototype scenography based on this double- 
sided setup, built and tested in a VR laboratory. Despite the obvious advantages of 
working in a controlled setting as opposed to an actual stage, that experience high- 
lighted the effort required to install the screen apparatus and to calibrate it along with 
a tracking system. Nonetheless, once in place the setup revealed quite remarkable 
capabilities. Both sides of the stage can support 3D visuals, tracking and multimodal 
feedback without interfering with each other, hence leading to very high peaks of 
Performers and Spectators Transportation. As previously mentioned, interaction is 
potentially extremely varied (wide Interaction Spectrum) and easy to understand 
(high Spectators Awareness), with the caveat that the playing of the [VMI must hap- 
pen in the space between the musician and the audience. But where this setup excels 
are the visibility dimensions; thanks to the semi-transparent screen the physical bod- 
ies of both musician and audience can be completely visible to one another, much 
like the case of a traditional musical performance. The only clear limitation concerns 
Ensemble Potential, for the employment of such a complex projection-based setup 
makes it extremely difficult to immerse more than one musician on stage at a time. 


13.7 Conclusion 


In this chapter, we investigated the scenography of immersive virtual musical instru- 
ments. We first reviewed the constraints of both immersive virtual environments 
and digital musical performances. From these, we derived seven dimensions for the 
design of scenographies of IVMIs. We finally demonstrated how this dimension 
space can be used to analyse past performances and how it can inform the design of 
new ones. 

We also believe that this dimension space may result in an opportunity to improve 
the quality of IVMI scenographies. Scenographers may employ the dimensions to 
intervene in the most critical details of the stage setup, and choose technologies and 
spatial arrangements that make the performance as inclusive as possible, without 
the need to modify the instruments’ metaphor. Moreover, the proposed approach 
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to IVMI performance practice has the potential to influence instrument design too. 
For example, when the topology of a venue imposes too many constraints to build 
a proper scenography, the instrument designer may use the dimensions as a set of 
guidelines to adapt the IVMI to the encountered limitations. 

In similar eventualities, the outcome of the dimensional analysis carries important 
design feedbacks that might extend even beyond the specific stage scenario. Maybe 
the metaphor designed for the IVMI is simply too complex/idiosyncratic to result 
comprehensible to an external audience, whether or not the venue is suitable for 
the showcase of immersive performances; in other cases, it might be the specific 
combination of some of the parts of the instrument’s metaphor to hinder the transition 
from user to performer—for example, an interaction technique that is not compatible 
with the chosen visualisation paradigm. So, another scope of the dimensions is to 
preventively foster this kind of analysis, and push the designer to question the nature 
of their musical VE (i.e., instrument or installation?) during the very design phase. 

We can see multiple extensions of our dimension space, which would allow for 
(1) a stronger integration of the various perceptual aspects in a performance and (2) 
refinements in the analysis to handle the complexity of performance scenographies. 
First, the dimensions that we proposed, in particular transportation, tend to focus on 
the visual immersion of both the audience and performers, i.e., the choice of display 
technology. While presence in a virtual environment and the experience of musical 
performances are very strongly impacted by the visual perception, other modalities 
are also essential. Our dimensions could therefore be refined to take into account the 
auditory and haptic transportation and the interactivity for the audience. 

Second, in this chapter, we chose to use the word scenography to describe techni- 
cal design choices. In this regard, a possible refinement of our dimension space would 
be to distinguish between stage setups, which can be informed by the dimensions, 
and the development of the setups into shared musical experiences (i.e., actual IVMI 
scenographies!), which require further discussion, and potentially an even more qual- 
itative analysis approach. However, given a set of choices, the diversity of potential 
implementations remains very high. In fact, the relationship between constraints and 
the outcome of a performance is even more complex and more counter-intuitive than 
what one would expect. Skilled scenographers may carefully pay attention to the 
direct consequences of design decisions across virtuality and music, yet the strong 
entanglement among the constraints (and the different stakeholders) may make any 
prediction quite inconsistent. For example, it is hard to suspect that replicating on 
stage the same exact setup used by the musician to rehearse in studio could be detri- 
mental to the outcome of the performance. Such a design approach would preserve 
the intimacy with the instrument developed by the performer over hours of practice 
(DMI constraint, Sect. 13.2.1) and it would reinforce the level of immersion that is 
achieved on stage (VE constraint, Sect. 13.3.1); yet, it may clash with how the actual 
IVMI lends itself to a live stage realisation, as well as with venue specifics, audiences’ 
expectation and—always present—miscellaneous contingencies. As a consequence, 
the term “scenography” as intended in this work does not equal a predictable expe- 
rience. 
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